Re: [WIP] Performance Improvement by reducing WAL for Update Operation

Started by Amit kapilaover 13 years ago19 messages

amit.kapila@huawei.com

over 13 years ago

1 attachment(s)

From: Heikki Linnakangas [mailto:heikki(dot)linnakangas(at)enterprisedb(dot)com]
Sent: Monday, August 27, 2012 5:58 PM
To: Amit kapila
On 27.08.2012 15:18, Amit kapila wrote:

I have implemented the WAL Reduction Patch for the case of HOT Update as

pointed out by Simon and Robert. In this patch it only goes for Optimized
WAL in case of HOT Update with other restrictions same as in previous patch.

The performance numbers for this patch are attached in this mail. It has

improved by 90% if the page has fillfactor 80.

Now going forward I have following options:
a. Upload the patch in Open CF for WAL Reduction which contains

reductution for HOT and non-HOT updates.

b. Upload the patch in Open CF for WAL Reduction which contains

reductution for HOT updates.

c. Upload both the patches as different versions.

Let's do it for HOT updates only. Simon & Robert made good arguments on
why this is a bad idea for non-HOT updates.

Okay, I shall do it that way.
So now I shall send information about all the testing I have done for this
Patch and then Upload it in CF.

Rebased version of patch based on latest code.

With Regards,

Amit Kapila.

Attachments:

wal_update_changes_v2.patchapplication/octet-stream; name=wal_update_changes_v2.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 618,623 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 618,1045 ----
  }
  
  /*
+  * get_tuple_info - Gets the tuple offset and value.
+  *
+  * calculates the attribute value and offset, where the attribute ends in the
+  * tuple based on the attribute number and previous fetched attribute info.
+  *
+  * offset (I/P and O/P variable) - Input as end of previous attribute offset
+  *		and incase if it is a first attribute then it's value is zero.
+  *		Output as end of the current attribute in the tuple.
+  * usecacheoff (I/P and O/P variable) - Attribute cacheoff can be used or not.
+  */
+ static void
+ get_tuple_info(Form_pg_attribute *att, HeapTuple tuple, bits8 *bp,
+ 			   bool hasnulls, int attnum, Datum *value, uint16 *offset,
+ 			   bool *usecacheoff)
+ {
+ 	Form_pg_attribute thisatt = att[attnum];
+ 	uint16		off = *offset;
+ 	bool		slow = *usecacheoff;
+ 	char	   *tp;
+ 	HeapTupleHeader tup = tuple->t_data;
+ 
+ 	tp = (char *) tup + tup->t_hoff;
+ 
+ 	if (hasnulls && att_isnull(attnum, bp))
+ 	{
+ 		slow = true;			/* can't use attcacheoff anymore */
+ 		*offset = off;
+ 		*usecacheoff = slow;
+ 		return;
+ 	}
+ 
+ 	if (!slow && thisatt->attcacheoff >= 0)
+ 		off = thisatt->attcacheoff;
+ 	else if (thisatt->attlen == -1)
+ 	{
+ 		/*
+ 		 * We can only cache the offset for a varlena attribute if the offset
+ 		 * is already suitably aligned, so that there would be no pad bytes in
+ 		 * any case: then the offset will be valid for either an aligned or
+ 		 * unaligned value.
+ 		 */
+ 		if (!slow &&
+ 			off == att_align_nominal(off, thisatt->attalign))
+ 			thisatt->attcacheoff = off;
+ 		else
+ 		{
+ 			off = att_align_pointer(off, thisatt->attalign, -1,
+ 									tp + off);
+ 			slow = true;
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/* not varlena, so safe to use att_align_nominal */
+ 		off = att_align_nominal(off, thisatt->attalign);
+ 
+ 		if (!slow)
+ 			thisatt->attcacheoff = off;
+ 	}
+ 
+ 	*value = fetchatt(thisatt, tp + off);
+ 
+ 	off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ 
+ 	if (thisatt->attlen <= 0)
+ 		slow = true;			/* can't use attcacheoff anymore */
+ 
+ 	*offset = off;
+ 	*usecacheoff = slow;
+ }
+ 
+ 
+ /*
+  * encode_xlog_update
+  *		Forms a diff tuple from old and new tuple with the modified columns.
+  *
+  *		att - attribute list.
+  *		oldtup - pointer to the old tuple.
+  *		heaptup - pointer to the modified tuple.
+  *		wal_tup - pointer to the wal record which needs to be formed from old
+ 				  and new tuples by using the modified columns list.
+  *		modifiedCols - modified columns list by the update command.
+  */
+ void
+ encode_xlog_update(Form_pg_attribute *att, HeapTuple oldtup,
+ 				   HeapTuple heaptup, HeapTuple wal_tup,
+ 				   Bitmapset *modifiedCols)
+ {
+ 	int			numberOfAttributes;
+ 	uint16		cur_offset = 0,
+ 				prev_offset = 0,
+ 				offset = 0;
+ 	int			attnum;
+ 	HeapTupleHeader newtuphdr = heaptup->t_data;
+ 	bits8	   *new_bp = newtuphdr->t_bits,
+ 			   *old_bp = oldtup->t_data->t_bits;
+ 	bool		old_hasnulls = HeapTupleHasNulls(oldtup);
+ 	bool		new_hasnulls = HeapTupleHasNulls(heaptup);
+ 	bool		cur_usecacheoff = false,
+ 				prev_usecacheoff = false;
+ 	Datum		cur_value,
+ 				prev_value;
+ 	uint16		data_length;
+ 	bool		check_for_padding = false;
+ 	char	   *data;
+ 	uint16		wal_offset = 0;
+ 
+ 	numberOfAttributes = HeapTupleHeaderGetNatts(newtuphdr);
+ 
+ 	data = (char *) wal_tup->t_data;
+ 	wal_offset = newtuphdr->t_hoff;
+ 
+ 	/* Copy the tuple header to the WAL tuple */
+ 	memcpy(data, heaptup->t_data, wal_offset);
+ 
+ 	for (attnum = 0; attnum < numberOfAttributes; attnum++)
+ 	{
+ 		/*
+ 		 * If the attribute is modified by the update operation, store the
+ 		 * appropiate offsets in the WAL record, otherwise skip to the next
+ 		 * attribute.
+ 		 */
+ 		if (bms_is_member((attnum + 1) - FirstLowInvalidHeapAttributeNumber,
+ 						  modifiedCols))
+ 		{
+ 			check_for_padding = true;
+ 
+ 			/*
+ 			 * calculate the offset where the modified attribute starts in the
+ 			 * old tuple used to store in the WAL record, this will be used to
+ 			 * traverse the old tuple during recovery.
+ 			 */
+ 			if (prev_offset)
+ 			{
+ 				*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_COPY;
+ 				wal_offset += sizeof(uint8);
+ 
+ 				wal_offset = SHORTALIGN(wal_offset);
+ 
+ 				*(uint16 *) (data + wal_offset) = prev_offset;
+ 				wal_offset += sizeof(uint16);
+ 			}
+ 
+ 			/* calculate the old tuple field length which needs to ignored */
+ 			offset = prev_offset;
+ 			get_tuple_info(att, oldtup, old_bp, old_hasnulls, attnum,
+ 						   &prev_value, &prev_offset, &prev_usecacheoff);
+ 
+ 			data_length = prev_offset - offset;
+ 
+ 			if (data_length)
+ 			{
+ 				*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_IGN;
+ 				wal_offset += sizeof(uint8);
+ 
+ 				wal_offset = SHORTALIGN(wal_offset);
+ 
+ 				*(uint16 *) (data + wal_offset) = data_length;
+ 				wal_offset += sizeof(uint16);
+ 			}
+ 
+ 			/*
+ 			 * calculate the new tuple field start position to check whether
+ 			 * any padding is required or not.
+ 			 */
+ 			offset = cur_offset;
+ 			cur_offset = att_align_pointer(cur_offset,
+ 								  att[attnum]->attalign, att[attnum]->attlen,
+ 						(char *) newtuphdr + newtuphdr->t_hoff + cur_offset);
+ 
+ 			data_length = cur_offset - offset;
+ 
+ 			/*
+ 			 * The above calculation is required to identify, that any
+ 			 * alignment is required or not. And the padding command is added
+ 			 * only incase of that the data is not NULL. which is done at
+ 			 * below.
+ 			 */
+ 
+ 			offset = cur_offset;
+ 			get_tuple_info(att, heaptup, new_bp, new_hasnulls, attnum,
+ 						   &cur_value, &cur_offset, &cur_usecacheoff);
+ 
+ 			/* if the new tuple data is null then nothing is required to add */
+ 			if (new_hasnulls && att_isnull(attnum, new_bp))
+ 			{
+ 				continue;
+ 			}
+ 
+ 			/* Add the padding if requires as the data is not NULL */
+ 			if (data_length)
+ 			{
+ 				*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_PAD;
+ 				wal_offset += sizeof(uint8);
+ 
+ 				*(uint8 *) (data + wal_offset) = data_length;
+ 				wal_offset += sizeof(uint8);
+ 			}
+ 
+ 			/* get the attribute value and end offset for same */
+ 			*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_ADD;
+ 			wal_offset += sizeof(uint8);
+ 
+ 			wal_offset = SHORTALIGN(wal_offset);
+ 
+ 			data_length = cur_offset - offset;
+ 			*(uint16 *) (data + wal_offset) = data_length;
+ 			wal_offset += sizeof(uint16);
+ 
+ 			if (att[attnum]->attbyval)
+ 			{
+ 				/* pass-by-value */
+ 				char		tempdata[sizeof(Datum)];
+ 
+ 				/*
+ 				 * Here we are not storing the data as aligned in the WAL
+ 				 * record as we don't have the tuple descriptor while
+ 				 * replaying the xlog.
+ 				 *
+ 				 * But this alignment is of the data is taken care while
+ 				 * framing the tuple during heap_xlog_update.
+ 				 */
+ 				store_att_byval(tempdata,
+ 								cur_value,
+ 								att[attnum]->attlen);
+ 				memcpy((data + wal_offset), tempdata, att[attnum]->attlen);
+ 			}
+ 			else
+ 			{
+ 				memcpy((data + wal_offset),
+ 					   DatumGetPointer(cur_value),
+ 					   data_length);
+ 			}
+ 
+ 			wal_offset += data_length;
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * padding is required if the previous field is modified, so check
+ 			 * whether padding is required or not.
+ 			 *
+ 			 * The attnum is not modified so if the data in the old tuple is
+ 			 * NULL then in the new tuple also the field data is NULL.
+ 			 */
+ 			if (check_for_padding && !att_isnull(attnum, old_bp))
+ 			{
+ 				check_for_padding = false;
+ 
+ 				/*
+ 				 * calculate the old tuple field start position, required to
+ 				 * ignore if any alignmet is present.
+ 				 */
+ 				offset = prev_offset;
+ 				prev_offset = att_align_pointer(prev_offset,
+ 								  att[attnum]->attalign, att[attnum]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + prev_offset);
+ 
+ 				data_length = prev_offset - offset;
+ 
+ 				if (data_length)
+ 				{
+ 					*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_IGN;
+ 					wal_offset += sizeof(uint8);
+ 
+ 					wal_offset = SHORTALIGN(wal_offset);
+ 
+ 					*(uint16 *) (data + wal_offset) = data_length;
+ 					wal_offset += sizeof(uint16);
+ 				}
+ 
+ 				/*
+ 				 * calculate the new tuple field start position to check
+ 				 * whether any padding is required or not because field
+ 				 * alignment.
+ 				 */
+ 				offset = cur_offset;
+ 				cur_offset = att_align_pointer(cur_offset,
+ 								  att[attnum]->attalign, att[attnum]->attlen,
+ 						(char *) newtuphdr + newtuphdr->t_hoff + cur_offset);
+ 
+ 				data_length = cur_offset - offset;
+ 
+ 				if (data_length)
+ 				{
+ 					*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_PAD;
+ 					wal_offset += sizeof(uint8);
+ 
+ 					*(uint8 *) (data + wal_offset) = data_length;
+ 					wal_offset += sizeof(uint8);
+ 				}
+ 			}
+ 
+ 			get_tuple_info(att, oldtup, old_bp, old_hasnulls, attnum,
+ 						   &prev_value, &prev_offset, &prev_usecacheoff);
+ 
+ 			get_tuple_info(att, heaptup, new_bp, new_hasnulls, attnum,
+ 						   &cur_value, &cur_offset, &cur_usecacheoff);
+ 		}
+ 	}
+ 
+ 	wal_tup->t_len = wal_offset;
+ 	wal_tup->t_self = heaptup->t_self;
+ 	wal_tup->t_tableOid = heaptup->t_tableOid;
+ }
+ 
+ /*
+  * decode_xlog_update
+  *		deforms a diff tuple and forms the new tuple with the help of old tuple.
+  *
+  * The WAL record data is in the format as below
+  *
+  *	COPY + offset until copy required
+  *	IGN + length needs to be ignored from the old tuple.
+  *	PAD + length needs to padded with zero in new tuple.
+  *	ADD + length of data + data which is modified.
+  *
+  * For the COPY command, copy the specified length from old tuple.
+  *
+  * Once the old tuple data copied, then increase the offset by the
+  * copied length.
+  *
+  * For the IGN command, ignore the specified length in the old tuple.
+  *
+  * For the PAD command, fill with zeros of the specified length in
+  * the new tuple.
+  *
+  * For the ADD command, copy the corresponding length of data from WAL
+  * record to the new tuple.
+  *
+  * Repeat this procedure until the WAL record reaches the end.
+  *
+  * If any remaining left out old tuple data will be copied at last.
+  *
+  *	htup - old tuple data pointer from which new tuple needs to be formed.
+  *	old_tup_len - old tuple length.
+  *	data - pointer to the new tuple which needs to be framed.
+  *	new_tup_len - output of new tuple data length.
+  *	waldata - wal record pointer from which the new tuple needs to formed.
+  *	wal_len - wal record length.
+  */
+ void
+ decode_xlog_update(HeapTupleHeader htup, uint32 old_tup_len, char *data,
+ 				   uint32 *new_tup_len, char *waldata, uint32 wal_len)
+ {
+ 	uint8		command;
+ 	uint16		len = 0,
+ 				data_length,
+ 				prev_offset = 0,
+ 				cur_offset = 0;
+ 	char	   *olddata = (char *) htup + htup->t_hoff;
+ 
+ 	/*
+ 	 * Frame the new tuple from old tuple and WAL record
+ 	 */
+ 	len = 0;
+ 
+ 	/* Frame the new tuple from the old and WAL tuples */
+ 	while (len < wal_len)
+ 	{
+ 		command = *(uint8 *) (waldata + len);
+ 		len += sizeof(uint8);
+ 
+ 		switch (command)
+ 		{
+ 			case HEAP_UPDATE_WAL_OPT_COPY:
+ 				len = SHORTALIGN(len);
+ 				data_length = *(uint16 *) (waldata + len) - prev_offset;
+ 
+ 				/* Copy the old tuple data */
+ 				memcpy((data + cur_offset),
+ 					   (olddata + prev_offset),
+ 					   data_length);
+ 				cur_offset += data_length;
+ 				prev_offset += data_length;
+ 
+ 				len += sizeof(uint16);
+ 				break;
+ 			case HEAP_UPDATE_WAL_OPT_ADD:
+ 				len = SHORTALIGN(len);
+ 				data_length = *(uint16 *) (waldata + len);
+ 				len += sizeof(uint16);
+ 
+ 				/* Copy the modified attribute data from WAL record */
+ 				memcpy((data + cur_offset), (waldata + len), data_length);
+ 				cur_offset += data_length;
+ 				len += data_length;
+ 				break;
+ 			case HEAP_UPDATE_WAL_OPT_IGN:
+ 				len = SHORTALIGN(len);
+ 				data_length = *(uint16 *) (waldata + len);
+ 
+ 				/* Skip the oldtuple with data length in the WAL record */
+ 				prev_offset += data_length;
+ 				len += sizeof(uint16);
+ 				break;
+ 			case HEAP_UPDATE_WAL_OPT_PAD:
+ 				data_length = *(uint8 *) (waldata + len);
+ 				cur_offset += data_length;
+ 				len += sizeof(uint8);
+ 				break;
+ 			default:
+ 				Assert(0);
+ 				break;
+ 		}
+ 	}
+ 
+ 	/* Copy the remaining old tuple data to the new tuple */
+ 	if (prev_offset < old_tup_len)
+ 	{
+ 		memcpy((data + cur_offset),
+ 			   (olddata + prev_offset),
+ 			   (old_tup_len - prev_offset));
+ 		cur_offset += (old_tup_len - prev_offset);
+ 	}
+ 
+ 	*new_tup_len = cur_offset
+ 		+ (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
+ 
+ /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
   *		which are of the length indicated by tupleDescriptor->natts
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 71,77 ****
  #include "utils/syscache.h"
  #include "utils/tqual.h"
  
- 
  /* GUC variable */
  bool		synchronize_seqscans = true;
  
--- 71,76 ----
***************
*** 85,91 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
! 				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
  
--- 84,91 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
! 				bool all_visible_cleared, bool new_all_visible_cleared,
! 				bool diff_update);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
  
***************
*** 2708,2713 **** simple_heap_delete(Relation relation, ItemPointer tid)
--- 2708,2714 ----
   *	cid - update command ID (used for visibility test, and stored into
   *		cmax/cmin if successful)
   *	crosscheck - if not InvalidSnapshot, also check old tuple against this
+  *	modifiedCols - the modified column list of the update command.
   *	wait - true if should wait for any conflicting update to commit/abort
   *
   * Normal, successful return value is HeapTupleMayBeUpdated, which
***************
*** 2729,2735 **** simple_heap_delete(Relation relation, ItemPointer tid)
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2730,2737 ----
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, Bitmapset *modifiedCols,
! 			bool wait)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
***************
*** 2737,2742 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
--- 2739,2745 ----
  	ItemId		lp;
  	HeapTupleData oldtup;
  	HeapTuple	heaptup;
+ 	HeapTupleData wal_tup;
  	Page		page;
  	BlockNumber block;
  	Buffer		buffer,
***************
*** 2752,2757 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
--- 2755,2765 ----
  	bool		use_hot_update = false;
  	bool		all_visible_cleared = false;
  	bool		all_visible_cleared_new = false;
+ 	struct
+ 	{
+ 		HeapTupleHeaderData hdr;
+ 		char		data[MaxHeapTupleSize];
+ 	}			tbuf;
  
  	Assert(ItemPointerIsValid(otid));
  
***************
*** 3195,3204 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3203,3233 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 
! 		/*
! 		 * Apply the xlog diff update algorithm only for hot updates.
! 		 */
! 		if (modifiedCols && use_hot_update)
! 		{
! 			wal_tup.t_data = (HeapTupleHeader) &tbuf;
! 			encode_xlog_update(relation->rd_att->attrs, &oldtup, heaptup,
! 							   &wal_tup, modifiedCols);
! 
! 			recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 									 newbuf, &wal_tup,
! 									 all_visible_cleared,
! 									 all_visible_cleared_new,
! 									 true);
! 		}
! 		else
! 		{
! 			recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 									 newbuf, heaptup,
! 									 all_visible_cleared,
! 									 all_visible_cleared_new,
! 									 false);
! 		}
  
  		if (newbuf != buffer)
  		{
***************
*** 3385,3390 **** simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
--- 3414,3420 ----
  	result = heap_update(relation, otid, tup,
  						 &update_ctid, &update_xmax,
  						 GetCurrentCommandId(true), InvalidSnapshot,
+ 						 NULL,
  						 true /* wait for commit */ );
  	switch (result)
  	{
***************
*** 4429,4435 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  				Buffer newbuf, HeapTuple newtup,
! 				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
  	xl_heap_header xlhdr;
--- 4459,4466 ----
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  				Buffer newbuf, HeapTuple newtup,
! 				bool all_visible_cleared, bool new_all_visible_cleared,
! 				bool diff_update)
  {
  	xl_heap_update xlrec;
  	xl_heap_header xlhdr;
***************
*** 4448,4456 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4479,4493 ----
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.diff_update = diff_update;
  	xlrec.newtid = newtup->t_self;
! 
! 	/*
! 	 * MSB 4 bits tells PD_ALL_VISIBLE was cleared of new page and rest 4 bits
! 	 * for the old page
! 	 */
! 	xlrec.new_all_visible_cleared |= all_visible_cleared;
! 	xlrec.new_all_visible_cleared |= new_all_visible_cleared << 4;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 5239,5252 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	}			tbuf;
  	xl_heap_header xlhdr;
  	int			hsize;
! 	uint32		newlen;
  	Size		freespace;
  
  	/*
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5276,5293 ----
  	}			tbuf;
  	xl_heap_header xlhdr;
  	int			hsize;
! 	uint32		new_tup_len = 0;
  	Size		freespace;
  
+ 	/* Initialize the buffer, used to frame the new tuple */
+ 	MemSet((char *) &tbuf.hdr, 0, sizeof(HeapTupleHeaderData));
+ 	hsize = SizeOfHeapUpdate + SizeOfHeapHeader;
+ 
  	/*
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared & 0x0F)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5266,5272 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	}
  
  	/* Deal with old tuple version */
- 
  	buffer = XLogReadBuffer(xlrec->target.node,
  							ItemPointerGetBlockNumber(&(xlrec->target.tid)),
  							false);
--- 5307,5312 ----
***************
*** 5291,5296 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5331,5359 ----
  
  	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
+ 	if (xlrec->diff_update)
+ 	{
+ 		char	   *data = (char *) &tbuf.hdr + htup->t_hoff;
+ 		uint32		old_tup_len;
+ 		uint32		wal_len;
+ 		char	   *waldata = (char *) xlrec + hsize + htup->t_hoff
+ 		- offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 		wal_len = record->xl_len - hsize;
+ 		Assert(wal_len <= MaxHeapTupleSize);
+ 
+ 		wal_len -= (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits));
+ 
+ 		old_tup_len = ItemIdGetLength(lp) - htup->t_hoff;
+ 
+ 		/* copy exactly the tuple header present in the WAL to new tuple */
+ 		memcpy((char *) &tbuf.hdr + offsetof(HeapTupleHeaderData, t_bits),
+ 			   (char *) xlrec + hsize,
+ 			   (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits)));
+ 
+ 		decode_xlog_update(htup, old_tup_len, data, &new_tup_len, waldata, wal_len);
+ 	}
+ 
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
  						  HEAP_XMAX_IS_MULTI |
***************
*** 5308,5314 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5371,5377 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->new_all_visible_cleared & 0x0F)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5317,5322 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5380,5386 ----
  	 */
  	if (samepage)
  		goto newsame;
+ 
  	PageSetLSN(page, lsn);
  	PageSetTLI(page, ThisTimeLineID);
  	MarkBufferDirty(buffer);
***************
*** 5330,5336 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5394,5400 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if ((xlrec->new_all_visible_cleared >> 4) & 0x0F)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5377,5396 **** newsame:;
  	if (PageGetMaxOffsetNumber(page) + 1 < offnum)
  		elog(PANIC, "heap_update_redo: invalid max offset number");
  
- 	hsize = SizeOfHeapUpdate + SizeOfHeapHeader;
- 
- 	newlen = record->xl_len - hsize;
- 	Assert(newlen <= MaxHeapTupleSize);
  	memcpy((char *) &xlhdr,
  		   (char *) xlrec + SizeOfHeapUpdate,
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
! 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
! 	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
  	htup->t_hoff = xlhdr.t_hoff;
--- 5441,5464 ----
  	if (PageGetMaxOffsetNumber(page) + 1 < offnum)
  		elog(PANIC, "heap_update_redo: invalid max offset number");
  
  	memcpy((char *) &xlhdr,
  		   (char *) xlrec + SizeOfHeapUpdate,
  		   SizeOfHeapHeader);
+ 
  	htup = &tbuf.hdr;
! 
! 	if (!xlrec->diff_update)
! 	{
! 		new_tup_len = record->xl_len - hsize;
! 		Assert(new_tup_len <= MaxHeapTupleSize);
! 
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   new_tup_len);
! 	}
! 
! 	new_tup_len += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
  	htup->t_hoff = xlhdr.t_hoff;
***************
*** 5400,5406 **** newsame:;
  	/* Make sure there is no forward chain link in t_ctid */
  	htup->t_ctid = xlrec->newtid;
  
! 	offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
--- 5468,5474 ----
  	/* Make sure there is no forward chain link in t_ctid */
  	htup->t_ctid = xlrec->newtid;
  
! 	offnum = PageAddItem(page, (Item) htup, new_tup_len, offnum, true, true);
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
***************
*** 49,54 ****
--- 49,55 ----
  #include "utils/memutils.h"
  #include "utils/rel.h"
  #include "utils/tqual.h"
+ #include "parser/parsetree.h"
  
  
  /*
***************
*** 479,490 **** ExecUpdate(ItemPointer tupleid,
--- 480,493 ----
  		   bool canSetTag)
  {
  	HeapTuple	tuple;
+ 	HeapTuple	tuple_bf_trigger;
  	ResultRelInfo *resultRelInfo;
  	Relation	resultRelationDesc;
  	HTSU_Result result;
  	ItemPointerData update_ctid;
  	TransactionId update_xmax;
  	List	   *recheckIndexes = NIL;
+ 	Bitmapset  *modifiedCols = NULL;
  
  	/*
  	 * abort the operation if not running transactions
***************
*** 496,502 **** ExecUpdate(ItemPointer tupleid,
  	 * get the heap tuple out of the tuple table slot, making sure we have a
  	 * writable copy
  	 */
! 	tuple = ExecMaterializeSlot(slot);
  
  	/*
  	 * get information on the (current) result relation
--- 499,505 ----
  	 * get the heap tuple out of the tuple table slot, making sure we have a
  	 * writable copy
  	 */
! 	tuple = tuple_bf_trigger = ExecMaterializeSlot(slot);
  
  	/*
  	 * get information on the (current) result relation
***************
*** 554,559 **** lreplace:;
--- 557,571 ----
  		if (resultRelationDesc->rd_att->constr)
  			ExecConstraints(resultRelInfo, slot, estate);
  
+ 		/* check whether the xlog diff update can be applied or not? */
+ 		if ((resultRelationDesc->rd_toastoid == InvalidOid)
+ 			&& (tuple_bf_trigger == tuple)
+ 			&& (tuple->t_len > MinHeapTupleSizeForDiffUpdate))
+ 		{
+ 			modifiedCols = (rt_fetch(resultRelInfo->ri_RangeTableIndex,
+ 									 estate->es_range_table)->modifiedCols);
+ 		}
+ 
  		/*
  		 * replace the heap tuple
  		 *
***************
*** 567,572 **** lreplace:;
--- 579,585 ----
  							 &update_ctid, &update_xmax,
  							 estate->es_output_cid,
  							 estate->es_crosscheck_snapshot,
+ 							 modifiedCols,
  							 true /* wait for commit */ );
  		switch (result)
  		{
***************
*** 597,602 **** lreplace:;
--- 610,623 ----
  						*tupleid = update_ctid;
  						slot = ExecFilterJunk(resultRelInfo->ri_junkFilter, epqslot);
  						tuple = ExecMaterializeSlot(slot);
+ 
+ 						/*
+ 						 * Incase of revalidation reinitialize the values
+ 						 * which are used for the xlog diff update algorithm.
+ 						 */
+ 						tuple_bf_trigger = tuple;
+ 						modifiedCols = NULL;
+ 
  						goto lreplace;
  					}
  				}
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 105,111 **** extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
  			HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
  				Buffer *buffer, ItemPointer ctid,
  				TransactionId *update_xmax, CommandId cid,
--- 105,112 ----
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid,
  			HeapTuple newtup,
  			ItemPointer ctid, TransactionId *update_xmax,
! 			CommandId cid, Snapshot crosscheck, Bitmapset  *modifiedCols,
! 			bool wait);
  extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
  				Buffer *buffer, ItemPointer ctid,
  				TransactionId *update_xmax, CommandId cid,
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,149 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
--- 142,155 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool	diff_update;		/* optimized update or not */
! 	/*
! 	 * To keep the structure size same all_visible_cleared is merged with
! 	 * new_all_visible_cleared.
! 	 */
! 	bool	new_all_visible_cleared; /* MSB 4 bits tells PD_ALL_VISIBLE	was
! 										cleared of new page and rest 4 bits
! 										for the old page */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 528,533 **** struct MinimalTupleData
--- 528,546 ----
  		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
  
  
+ /* WAL Diff update options */
+ #define HEAP_UPDATE_WAL_OPT_COPY 0
+ #define HEAP_UPDATE_WAL_OPT_ADD  1
+ #define HEAP_UPDATE_WAL_OPT_IGN  2
+ #define HEAP_UPDATE_WAL_OPT_PAD  3
+ 
+ /*
+  * Minimum tuple length required by the tuple during update operation for doing
+  * WAL optimization of update operation.
+  */
+ #define MinHeapTupleSizeForDiffUpdate 128
+ 
+ 
  /* ----------------
   *		fastgetattr
   *

Heikki Linnakangas

hlinnakangas@vmware.com

over 13 years ago

In reply to: Amit kapila (#1)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 24.09.2012 13:57, Amit kapila wrote:

Rebased version of patch based on latest code.

When HOT was designed, we decided that heap_update needs to compare the
old and new attributes directly, with memcmp(), to determine whether any
of the indexed columns have changed. It was not deemed infeasible to
pass down that information from the executor. I don't remember the
details of why that was, but you seem to trying to same thing in this
patch, and pass the bitmap of modified cols from the executor to
heap_update(). I'm pretty sure that won't work, for the same reasons we
didn't do it for HOT.

I still feel that it would probably be better to use a generic delta
encoding scheme, instead of inventing one. How about VCDIFF
(http://tools.ietf.org/html/rfc3284), for example? Or you could reuse
the LZ compressor that we already have in the source tree. You can use
LZ for delta compression by initializing the history buffer of the
algorithm with the old tuple, and then compressing the new tuple as
usual. Or you could still use the knowledge of where the attributes
begin and end and which attributes were updated, and do the encoding
similar to how you did in the patch, but use LZ as the output format.
That way the decoding would be the same as LZ decompression.

- Heikki

Amit Kapila

amit.kapila@huawei.com

over 13 years ago

In reply to: Heikki Linnakangas (#2)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Tuesday, September 25, 2012 7:30 PM Heikki Linnakangas wrote:
On 24.09.2012 13:57, Amit kapila wrote:

Rebased version of patch based on latest code.

When HOT was designed, we decided that heap_update needs to compare the
old and new attributes directly, with memcmp(), to determine whether
any
of the indexed columns have changed. It was not deemed infeasible to
pass down that information from the executor. I don't remember the
details of why that was, but you seem to trying to same thing in this
patch, and pass the bitmap of modified cols from the executor to
heap_update(). I'm pretty sure that won't work, for the same reasons we
didn't do it for HOT.

I think the reason of not relying on modified columns can be some such case
where modified columns might not give the correct information.
It may be due to Before triggers can change the modified columns that's why
for HOT update we need to do
Comparison. In our case we have taken care of such a case by not doing
optimization, so not relying on modified columns.

If you feel it is must to do the comparison, we can do it in same way as we
identify for HOT?

I still feel that it would probably be better to use a generic delta
encoding scheme, instead of inventing one. How about VCDIFF
(http://tools.ietf.org/html/rfc3284), for example? Or you could reuse
the LZ compressor that we already have in the source tree. You can use
LZ for delta compression by initializing the history buffer of the
algorithm with the old tuple, and then compressing the new tuple as
usual.

Or you could still use the knowledge of where the attributes
begin and end and which attributes were updated, and do the encoding
similar to how you did in the patch, but use LZ as the output format.
That way the decoding would be the same as LZ decompression.

Can you please explain me why you think that after doing encoding doing LZ
compression on it is better, as already we have reduced the amount of WAL
for update by only storing changed column information?

a. is it to further reduce the size of WAL
b. storing diff WAL in some standard format
c. or does it give any other kind of benefit

With Regards,
Amit Kapila.

Noah Misch

noah@leadboat.com

over 13 years ago

In reply to: Amit kapila (#1)

Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Mon, Sep 24, 2012 at 10:57:02AM +0000, Amit kapila wrote:

Rebased version of patch based on latest code.

I like the direction you're taking with this patch; the gains are striking,
especially considering the isolation of the changes.

You cannot assume executor-unmodified columns are also unmodified from
heap_update()'s perspective. Expansion in one column may instigate TOAST
compression of a logically-unmodified column, and that counts as a change for
xlog delta purposes. You do currently skip the optimization for relations
having a TOAST table, but TOAST compression can still apply. Observe this
with text columns of storage mode PLAIN. I see two ways out: skip the new
behavior when need_toast=true, or compare all inline column data, not just
what the executor modified. One can probably construct a benchmark favoring
either choice. I'd lean toward the latter; wide tuples are the kind this
change can most help. If the marginal advantage of ignoring known-unmodified
columns proves important, we can always bring it back after designing a way to
track which columns changed in the toaster.

Given that, why not treat the tuple as an opaque series of bytes and not worry
about datum boundaries? When several narrow columns change together, say a
sequence of sixteen smallint columns, you will use fewer binary delta commands
by representing the change with a single 32-byte substitution. If an UPDATE
changes just part of a long datum, the delta encoding algorithm will still be
able to save considerable space. That case arises in many forms: changing
one word in a long string, changing one element in a long array, changing one
field of a composite-typed column. Granted, this makes the choice of delta
encoding algorithm more important.

Like Heikki, I'm left wondering why your custom delta encoding is preferable
to an encoding from the literature. Your encoding has much in common with
VCDIFF, even sharing two exact command names. If a custom encoding is the
right thing, code comments or a README section should at least discuss the
advantages over an established alternative. Idle thought: it might pay off to
use 1-byte sizes and offsets most of the time. Tuples shorter than 256 bytes
are common; for longer tuples, we can afford wider offsets.

The benchmarks you posted upthread were helpful. I think benchmarking with
fsync=off is best if you don't have a battery-backed write controller or SSD.
Otherwise, fsync time dominates a pgbench run. Please benchmark recovery. To
do so, set up WAL archiving and take a base backup from a fresh cluster. Run
pgbench for awhile. Finally, observe the elapsed time to recover your base
backup to the end of archived WAL.

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c

+ /*
+  * encode_xlog_update
+  *		Forms a diff tuple from old and new tuple with the modified columns.
+  *
+  *		att - attribute list.
+  *		oldtup - pointer to the old tuple.
+  *		heaptup - pointer to the modified tuple.
+  *		wal_tup - pointer to the wal record which needs to be formed from old
+ 				  and new tuples by using the modified columns list.
+  *		modifiedCols - modified columns list by the update command.
+  */
+ void
+ encode_xlog_update(Form_pg_attribute *att, HeapTuple oldtup,
+ 				   HeapTuple heaptup, HeapTuple wal_tup,
+ 				   Bitmapset *modifiedCols)

This name is too generic for an extern function. Maybe "heap_delta_encode"?

+ void
+ decode_xlog_update(HeapTupleHeader htup, uint32 old_tup_len, char *data,
+ 				   uint32 *new_tup_len, char *waldata, uint32 wal_len)

Likwise, maybe "heap_delta_decode" here.

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 71,77 ****
#include "utils/syscache.h"
#include "utils/tqual.h"

-
/* GUC variable */
bool synchronize_seqscans = true;

Spurious whitespace change.

***************
*** 3195,3204 **** l2:
/* XLOG stuff */
if (RelationNeedsWAL(relation))
{
! XLogRecPtr recptr = log_heap_update(relation, buffer, oldtup.t_self,
! newbuf, heaptup,
! all_visible_cleared,
! all_visible_cleared_new);
if (newbuf != buffer)
{
--- 3203,3233 ----
/* XLOG stuff */
if (RelationNeedsWAL(relation))
{
! 		XLogRecPtr	recptr;
! 
! 		/*
! 		 * Apply the xlog diff update algorithm only for hot updates.
! 		 */
! 		if (modifiedCols && use_hot_update)

Why HOT in particular? I understand the arguments upthread for skipping the
optimization when the update crosses pages, but the other condition for HOT
(no changes to indexed columns) seems irrelevant here. Why not retest "newbuf
== buffer", instead?

In any event, the comment should memorialize rationale behind any excluded
cases, not merely restate the obvious fact that the code excludes them.

For the record, I think that if this pays off for intra-page updates, we
should eventually extend it to cross-page updates under full_page_writes=on.
If we were already logging deltas for all updates, I doubt we would adopt a
proposal to add complete-tuple logging as a disaster recovery aid. When
something corrupts a block, all bets are off.

***************
*** 5239,5252 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
} tbuf;
xl_heap_header xlhdr;
int hsize;
! uint32 newlen;
Size freespace;
/*
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! 	if (xlrec->all_visible_cleared)
{
Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5276,5293 ----
}			tbuf;
xl_heap_header xlhdr;
int			hsize;
! 	uint32		new_tup_len = 0;

This variable rename looks spurious.

Size freespace;

+ 	/* Initialize the buffer, used to frame the new tuple */
+ 	MemSet((char *) &tbuf.hdr, 0, sizeof(HeapTupleHeaderData));
+ 	hsize = SizeOfHeapUpdate + SizeOfHeapHeader;
+ 
/*
* The visibility map may need to be fixed even if the heap page is
* already up-to-date.
*/
! 	if (xlrec->new_all_visible_cleared & 0x0F)
{
Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5266,5272 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
}

/* Deal with old tuple version */
-
buffer = XLogReadBuffer(xlrec->target.node,
ItemPointerGetBlockNumber(&(xlrec->target.tid)),
false);

Spurious whitespace change.

--- 5307,5312 ----
***************
*** 5291,5296 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5331,5359 ----

htup = (HeapTupleHeader) PageGetItem(page, lp);

+ 	if (xlrec->diff_update)
+ 	{
+ 		char	   *data = (char *) &tbuf.hdr + htup->t_hoff;
+ 		uint32		old_tup_len;
+ 		uint32		wal_len;
+ 		char	   *waldata = (char *) xlrec + hsize + htup->t_hoff
+ 		- offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 		wal_len = record->xl_len - hsize;
+ 		Assert(wal_len <= MaxHeapTupleSize);
+ 
+ 		wal_len -= (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits));
+ 
+ 		old_tup_len = ItemIdGetLength(lp) - htup->t_hoff;
+ 
+ 		/* copy exactly the tuple header present in the WAL to new tuple */
+ 		memcpy((char *) &tbuf.hdr + offsetof(HeapTupleHeaderData, t_bits),
+ 			   (char *) xlrec + hsize,
+ 			   (htup->t_hoff - offsetof(HeapTupleHeaderData, t_bits)));
+ 
+ 		decode_xlog_update(htup, old_tup_len, data, &new_tup_len, waldata, wal_len);

I think the above code should appear later, with treatment of the new tuple.

encode_xlog_update() and decode_xlog_update() should be essentially-inverse
APIs. Instead, you have encode_xlog_update() working with HeapTuple arguments
while decode_xlog_update() works with opaque pointers. encode_xlog_update()
clones the header, but decode_xlog_update() leaves that to its caller. Those
decisions are convenient enough for these heapam.c callers, but I think
heap_xlog_update() should work harder so the decode_xlog_update() argument
list need not appear ad hoc.

***************
*** 5317,5322 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5380,5386 ----
*/
if (samepage)
goto newsame;
+ 
PageSetLSN(page, lsn);
PageSetTLI(page, ThisTimeLineID);
MarkBufferDirty(buffer);

Spurious whitespace change.

*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c

***************
*** 554,559 **** lreplace:;
--- 557,571 ----
if (resultRelationDesc->rd_att->constr)
ExecConstraints(resultRelInfo, slot, estate);

+ 		/* check whether the xlog diff update can be applied or not? */
+ 		if ((resultRelationDesc->rd_toastoid == InvalidOid)
+ 			&& (tuple_bf_trigger == tuple)
+ 			&& (tuple->t_len > MinHeapTupleSizeForDiffUpdate))

Having the executor apply these tests introduces a modularity violation.

If any of these restrictions are to remain, a comment at code enforcing them
should give rationale for each.

*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,149 **** typedef struct xl_heap_update
{
xl_heaptid	target;			/* deleted tuple id */
ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;

--- 142,155 ----
{
xl_heaptid	target;			/* deleted tuple id */
ItemPointerData newtid;		/* new inserted tuple id */
! 	bool	diff_update;		/* optimized update or not */
! 	/*
! 	 * To keep the structure size same all_visible_cleared is merged with
! 	 * new_all_visible_cleared.
! 	 */
! 	bool	new_all_visible_cleared; /* MSB 4 bits tells PD_ALL_VISIBLE	was
! 										cleared of new page and rest 4 bits
! 										for the old page */

In place of these two fields, store three flags in a uint8 field.

/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_update;

*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 528,533 **** struct MinimalTupleData
--- 528,546 ----
HeapTupleHeaderSetOid((tuple)->t_data, (oid))

+ /* WAL Diff update options */
+ #define HEAP_UPDATE_WAL_OPT_COPY 0
+ #define HEAP_UPDATE_WAL_OPT_ADD  1
+ #define HEAP_UPDATE_WAL_OPT_IGN  2
+ #define HEAP_UPDATE_WAL_OPT_PAD  3

These defines can remain private to the file implementing the encoding.

+ 
+ /*
+  * Minimum tuple length required by the tuple during update operation for doing
+  * WAL optimization of update operation.
+  */
+ #define MinHeapTupleSizeForDiffUpdate 128

It's not at all clear to me what threshold to use, and 128 feels high. If you
want to select a threshold, I suggest benchmarking through a binary search of
small tuple sizes. That being said, though I have no doubt the algorithm will
lose when updating a single one-byte column, it will also finish darn quickly.
Might it be enough to just run the delta algorithm every time but discard any
diff wider than the complete new tuple?

Thanks,
nm

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Noah Misch (#4)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

Noah Misch <noah@leadboat.com> writes:

You cannot assume executor-unmodified columns are also unmodified from
heap_update()'s perspective. Expansion in one column may instigate TOAST
compression of a logically-unmodified column, and that counts as a change for
xlog delta purposes.

Um ... what about BEFORE triggers?

Frankly, I think that expecting the executor to tell you which columns
have been modified is a non-starter. We have a solution for HOT and
it's silly to do the same thing differently just a few lines away.

regards, tom lane

Amit Kapila

amit.kapila@huawei.com

over 13 years ago

In reply to: Tom Lane (#5)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Thursday, September 27, 2012 10:19 AM

Noah Misch <noah@leadboat.com> writes:

You cannot assume executor-unmodified columns are also unmodified

from

heap_update()'s perspective. Expansion in one column may instigate

TOAST

compression of a logically-unmodified column, and that counts as a

change for

xlog delta purposes.

Um ... what about BEFORE triggers?

This optimization will not apply in case Before triggers updates the tuple.

Frankly, I think that expecting the executor to tell you which columns
have been modified is a non-starter. We have a solution for HOT and
it's silly to do the same thing differently just a few lines away.

My apprehension is that it can hit the performance advantage if we compare
all attributes to check which have been modified and that to under Buffer
Exclusive Lock.
In case of HOT only the index attributes get compared.

I agree that doing things differently at 2 nearby places is not good.
So I will do it same way as for HOT and then take the performance data again
and if there is no big impact then
we can do it that way.

With Regards,
Amit Kapila.

Heikki Linnakangas

hlinnakangas@vmware.com

over 13 years ago

In reply to: Amit Kapila (#3)

2 attachment(s)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 25.09.2012 18:27, Amit Kapila wrote:

If you feel it is must to do the comparison, we can do it in same way as we
identify for HOT?

Yeah. (But as discussed, I think it would be even better to just treat
the old and new tuple as an opaque chunk of bytes, and run them through
a generic delta algorithm).

Can you please explain me why you think that after doing encoding doing LZ
compression on it is better, as already we have reduced the amount of WAL
for update by only storing changed column information?

a. is it to further reduce the size of WAL
b. storing diff WAL in some standard format
c. or does it give any other kind of benefit

Potentially all of those. I don't know if it'd be better or worse, but
my gut feeling is that it would be simpler, and produce even more
compact WAL.

Attached is a simple patch to apply LZ compression to update WAL
records. I modified the LZ compressor so that it can optionally use a
separate "history" data, and the same history data must then be passed
to the decompression function. That makes it work as a pretty efficient
delta encoder, when you use the old tuple as the history data.

I ran some performance tests with the modified version of pgbench that
you posted earlier:

Current PostgreSQL master
-------------------------

tps = 941.601924 (excluding connections establishing)
pg_xlog_location_diff
-----------------------
721227944

pglz_wal_update_records.patch
-----------------------------

tps = 1039.792527 (excluding connections establishing)
pg_xlog_location_diff
-----------------------
419395208

pglz_wal_update_records.patch, COMPRESS_ONLY
--------------------------------------------

tps = 1009.682002 (excluding connections establishing)
pg_xlog_location_diff
-----------------------
422505104

Amit's wal_update_changes_hot_update.patch
------------------------------------------

tps = 1092.703883 (excluding connections establishing)
pg_xlog_location_diff
-----------------------
436031544

The COMPRESS_ONLY result is with the attached patch, but it just uses LZ
to compress the new tuple, without taking advantage of the old tuple.
The pg_xlog_location_diff value is the amount of WAL generated during
the pgbench run. Attached is also the shell script I used to run these
tests.

The conclusion is that there isn't very much difference among the
patches. They all squeeze the WAL to about the same size, and the
increase in TPS is roughly the same.

I think more performance testing is required. The modified pgbench test
isn't necessarily very representative of a real-life application. The
gain (or loss) of this patch is going to depend a lot on how many
columns are updated, and in what ways. Need to test more scenarios, with
many different database schemas.

The LZ approach has the advantage that it can take advantage of all
kinds of similarities between old and new tuple. For example, if you
swap the values of two columns, LZ will encode that efficiently. Or if
you insert a character in the middle of a long string. On the flipside,
it's probably more expensive. Then again, you have to do a memcmp() to
detect which columns have changed with your approach, and that's not
free either. That was not yet included in the patch version I tested.
Another consideration is that when you compress the record more, you
have less data to calculate CRC for. CRC calculation tends to be quite
expensive, so even quite aggressive compression might be a win. Yet
another consideration is that the compression/encoding is done while
holding a lock on the buffer. For the sake of concurrency, you want to
keep the duration the lock is held as short as possible.

- Heikki

Attachments:

pglz_wal_update_records.patchtext/x-diff; name=pglz_wal_update_records.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5a4591e..56b53a5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
@@ -85,6 +86,7 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
 				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+				HeapTuple oldtup,
 				bool all_visible_cleared, bool new_all_visible_cleared);
 static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
 					   HeapTuple oldtup, HeapTuple newtup);
@@ -3195,10 +3197,12 @@ l2:
 	/* XLOG stuff */
 	if (RelationNeedsWAL(relation))
 	{
-		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
-											 newbuf, heaptup,
-											 all_visible_cleared,
-											 all_visible_cleared_new);
+		XLogRecPtr	recptr;
+
+		recptr = log_heap_update(relation, buffer, oldtup.t_self,
+								 newbuf, heaptup, &oldtup,
+								 all_visible_cleared,
+								 all_visible_cleared_new);
 
 		if (newbuf != buffer)
 		{
@@ -4428,7 +4432,7 @@ log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
  */
 static XLogRecPtr
 log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
-				Buffer newbuf, HeapTuple newtup,
+				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
 				bool all_visible_cleared, bool new_all_visible_cleared)
 {
 	xl_heap_update xlrec;
@@ -4437,6 +4441,16 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	union
+	{
+		PGLZ_Header pglzheader;
+		char buf[BLCKSZ];
+	} buf;
+	char	   *newtupdata;
+	int			newtuplen;
+	char	   *oldtupdata;
+	int			oldtuplen;
+	bool		compressed = false;
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -4446,11 +4460,43 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	oldtupdata = ((char *) oldtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	if (oldbuf == newbuf && oldtup)
+	{
+		/*
+		 * enable this if you only want to compress the new tuple as is,
+		 * without taking advantage of the old tuple.
+		 */
+#ifdef COMPRESS_ONLY
+		oldtuplen = 0;
+#endif
+
+		/* Delta-encode the new tuple using the old tuple */
+		/* XXX: assert that the output buffer is large enough (PGLZ_MAX_OUTPUT) */
+		if (pglz_compress_with_history(newtupdata, newtuplen,
+									   oldtupdata, oldtuplen,
+									   (PGLZ_Header *) &buf.pglzheader, NULL))
+		{
+			compressed = true;
+			newtupdata = (char *) &buf.pglzheader;
+			newtuplen = VARSIZE(&buf.pglzheader);
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = from;
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -4478,12 +4524,13 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
 	rdata[2].next = &(rdata[3]);
 
 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
 
+
 	/* If new tuple is the single and first tuple on page... */
 	if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
 		PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
@@ -5232,6 +5279,8 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtup = NULL;
+	uint32		old_tup_len = 0;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -5246,7 +5295,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -5289,7 +5338,8 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtup = htup = (HeapTupleHeader) PageGetItem(page, lp);
+	old_tup_len = ItemIdGetLength(lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
 						  HEAP_XMAX_INVALID |
@@ -5308,7 +5358,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -5330,7 +5380,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -5380,16 +5430,40 @@ newsame:;
 	hsize = SizeOfHeapUpdate + SizeOfHeapHeader;
 
 	newlen = record->xl_len - hsize;
+
 	Assert(newlen <= MaxHeapTupleSize);
 	memcpy((char *) &xlhdr,
 		   (char *) xlrec + SizeOfHeapUpdate,
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the new tuple was delta-encoded, decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
+
+		/*
+		 * FIXME: this won't work on architectures with strict alignment,
+		 * because encoded_data might not be aligned and pglz_decompress
+		 * assumes that the PGLZ_Header is correctly aligned. XXX: also add
+		 * some sanity checks with PGLZ_RAW_SIZE here.
+		 */
+		pglz_decompress_with_history(encoded_data,
+									 ((char *) htup) + offsetof(HeapTupleHeaderData, t_bits),
+									 ((char *) oldtup) + offsetof(HeapTupleHeaderData, t_bits),
+									 old_tup_len - offsetof(HeapTupleHeaderData, t_bits));
+		newlen = encoded_data->rawsize;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -5404,7 +5478,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 466982e..eb24355 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -482,6 +482,20 @@ bool
 pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy)
 {
+	return pglz_compress_with_history(source, slen, NULL, 0, dest, strategy);
+}
+
+/*
+ * Like pglz_compress, but uses another piece of data to initialize the
+ * history table. When decompressing, you must pass the same history data
+ * to pglz_decompress_with_history(). This makes it possible to do simple
+ * delta compression.
+ */
+bool
+pglz_compress_with_history(const char *source, int32 slen,
+						   const char *history, int32 hlen,
+						   PGLZ_Header *dest, const PGLZ_Strategy *strategy)
+{
 	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
 	unsigned char *bstart = bp;
 	int			hist_next = 0;
@@ -560,6 +574,24 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	 * hist_entries[] array; its entries are initialized as they are used.
 	 */
 	memset(hist_start, 0, sizeof(hist_start));
+	if (hlen > 0)
+	{
+		const char *hp = history;
+		const char *hend = history + hlen;
+		while (hp < hend)
+		{
+			/*
+			 * XXX: I think this doesn't handle the last few bytes of the
+			 * history correctly, or at least not in the most efficient way.
+			 * Logically, we should behave like the history and the source
+			 * strings are concatenated, but we use 'hend' here.
+			 */
+			pglz_hist_add(hist_start, hist_entries,
+						  hist_next, hist_recycle,
+						  hp, hend);
+			hp++;			/* Do not do this ++ in the line above! */
+		}
+	}
 
 	/*
 	 * Compress the source directly into the output buffer.
@@ -647,10 +679,21 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 void
 pglz_decompress(const PGLZ_Header *source, char *dest)
 {
+	pglz_decompress_with_history(source, dest, NULL, 0);
+}
+
+void
+pglz_decompress_with_history(const PGLZ_Header *source, char *dest,
+							 const char *history, int32 hlen)
+{
 	const unsigned char *sp;
 	const unsigned char *srcend;
 	unsigned char *dp;
 	unsigned char *destend;
+	unsigned char *hend = NULL;
+
+	if (hlen > 0)
+		hend = (unsigned char *) history + hlen;
 
 	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
 	srcend = ((const unsigned char *) source) + VARSIZE(source);
@@ -707,7 +750,20 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 				 */
 				while (len--)
 				{
-					*dp = dp[-off];
+					if (off > (dp - (unsigned char *) dest))
+					{
+						/*
+						 * this offset refers to the history passed by
+						 * the caller in a separate buffer.
+						 */
+						int hoff = off - (dp - (unsigned char *) dest);
+						Assert(hoff < hlen);
+						*dp = hend[-hoff];
+					}
+					else
+					{
+						*dp = dp[-off];
+					}
 					dp++;
 				}
 			}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8ec710e..5dd2809 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -142,12 +142,16 @@ typedef struct xl_heap_update
 {
 	xl_heaptid	target;			/* deleted tuple id */
 	ItemPointerData newtid;		/* new inserted tuple id */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	char		flags;
+
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..cddc476 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_compress_with_history(const char *source, int32 slen,
+						   const char *history, int32 hlen,
+						   PGLZ_Header *dest,
+						   const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_decompress_with_history(const PGLZ_Header *source, char *dest,
+										 const char *history, int32 hlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */

pgbench-xlogtest.shapplication/x-sh; name=pgbench-xlogtest.shDownload

Amit Kapila

amit.kapila@huawei.com

over 13 years ago

In reply to: Heikki Linnakangas (#7)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
On 25.09.2012 18:27, Amit Kapila wrote:

If you feel it is must to do the comparison, we can do it in same way
as we identify for HOT?

Yeah. (But as discussed, I think it would be even better to just treat
the old and new tuple as an opaque chunk of bytes, and run them through
a generic delta algorithm).

Thank you for the modified patch.

The conclusion is that there isn't very much difference among the
patches. They all squeeze the WAL to about the same size, and the
increase in TPS is roughly the same.

I think more performance testing is required. The modified pgbench test
isn't necessarily very representative of a real-life application. The
gain (or loss) of this patch is going to depend a lot on how many
columns are updated, and in what ways. Need to test more scenarios,
with many different database schemas.

The LZ approach has the advantage that it can take advantage of all
kinds of similarities between old and new tuple. For example, if you
swap the values of two columns, LZ will encode that efficiently. Or if
you insert a character in the middle of a long string. On the flipside,
it's probably more expensive. Then again, you have to do a memcmp() to
detect which columns have changed with your approach, and that's not
free either. That was not yet included in the patch version I tested.
Another consideration is that when you compress the record more, you
have less data to calculate CRC for. CRC calculation tends to be quite
expensive, so even quite aggressive compression might be a win. Yet
another consideration is that the compression/encoding is done while
holding a lock on the buffer. For the sake of concurrency, you want to
keep the duration the lock is held as short as possible.

Now I shall do the various tests for following and post it here:
a. Attached Patch in the mode where it takes advantage of history tuple
b. By changing the logic for modified column calculation to use calculation
for memcmp()

With Regards,
Amit Kapila.

Amit Kapila

amit.kapila@huawei.com

over 13 years ago

In reply to: Noah Misch (#4)

On Thursday, September 27, 2012 9:12 AM Noah Misch wrote:
On Mon, Sep 24, 2012 at 10:57:02AM +0000, Amit kapila wrote:

Rebased version of patch based on latest code.

I like the direction you're taking with this patch; the gains are
striking,
especially considering the isolation of the changes.

Thank you for a detailed review of the patch.

You cannot assume executor-unmodified columns are also unmodified from
heap_update()'s perspective. Expansion in one column may instigate
TOAST
compression of a logically-unmodified column, and that counts as a
change for
xlog delta purposes. You do currently skip the optimization for
relations
having a TOAST table, but TOAST compression can still apply. Observe
this
with text columns of storage mode PLAIN. I see two ways out: skip the
new
behavior when need_toast=true, or compare all inline column data, not
just
what the executor modified. One can probably construct a benchmark
favoring
either choice. I'd lean toward the latter; wide tuples are the kind
this
change can most help. If the marginal advantage of ignoring known-
unmodified
columns proves important, we can always bring it back after designing a
way to
track which columns changed in the toaster.

You are right that it can give benefit for both ways, but we should also see
which approach can
give better results for most of the scenario's.
As in most cases of Update I have observed, the change in values will not
increase the length of value to too much.
OTOH I am not sure may be there are many more scenario's which change the
length of updated value which can lead to scenario explained by you above.

Given that, why not treat the tuple as an opaque series of bytes and
not worry
about datum boundaries? When several narrow columns change together,
say a
sequence of sixteen smallint columns, you will use fewer binary delta
commands
by representing the change with a single 32-byte substitution. If an
UPDATE
changes just part of a long datum, the delta encoding algorithm will
still be
able to save considerable space. That case arises in many forms:
changing
one word in a long string, changing one element in a long array,
changing one
field of a composite-typed column. Granted, this makes the choice of
delta
encoding algorithm more important.

Like Heikki, I'm left wondering why your custom delta encoding is
preferable
to an encoding from the literature. Your encoding has much in common
with
VCDIFF, even sharing two exact command names. If a custom encoding is
the
right thing, code comments or a README section should at least discuss
the
advantages over an established alternative. Idle thought: it might pay
off to
use 1-byte sizes and offsets most of the time. Tuples shorter than 256
bytes
are common; for longer tuples, we can afford wider offsets.

My apprehension was that it can affect the performance if do more work by
holding the lock.
If we use any standard technique like LZ of VCDiff, it has overhead of
comparison
and other things pertaining to their algorithm.
However using updated patch by Heikki, I can run the various performance
tests both for update operation as well as recovery.

With Regards,
Amit Kapila.

#10

Amit Kapila

amit.kapila@huawei.com

over 13 years ago

In reply to: Amit Kapila (#8)

4 attachment(s)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:

On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
On 25.09.2012 18:27, Amit Kapila wrote:

If you feel it is must to do the comparison, we can do it in same

way

as we identify for HOT?

Yeah. (But as discussed, I think it would be even better to just

treat

the old and new tuple as an opaque chunk of bytes, and run them

through

a generic delta algorithm).

Thank you for the modified patch.

The conclusion is that there isn't very much difference among the
patches. They all squeeze the WAL to about the same size, and the
increase in TPS is roughly the same.

I think more performance testing is required. The modified pgbench

test

isn't necessarily very representative of a real-life application. The
gain (or loss) of this patch is going to depend a lot on how many
columns are updated, and in what ways. Need to test more scenarios,
with many different database schemas.

I have done for few and planning for doing more.

Now I shall do the various tests for following and post it here:
a. Attached Patch in the mode where it takes advantage of history tuple
b. By changing the logic for modified column calculation to use
calculation
for memcmp()

Attached documents contain data for following scenarios for both 'a' (LZ
compression patch) and 'b' (modified wal patch) patches:

1. Using fixed string (last few bytes are random) to update the column
values.
Total record length = 1800
Updated columns length = 250
2. Using random string to update the column values
Total record length = 1800
Updated columns length = 250

Observations -
1. With both patches performance increase is very good .
2. Almost same performance increase with both patches with slightly more
for LZ compression patch.
3. TPS is varying with LZ patch, but if we take average it is equivalent to
other patch.

Other Performance tests I am planning to conduct
1. Using bigger random string to update the column values
Total record length = 1800
Updated columns length = 250
2. Using fixed string (last few bytes are random) to update the column
values.
Total record length = 1800
Updated columns length = 50, 100, 500, 750, 1000, 1500, 1800
3. Recovery performance test as suggested by Noah
4. Complete testing for LZ compression patch using testcases defined for
original patch

Kindly suggest more performance test cases which can make findings concrete
or incase you feel
above is sufficient then please confirm.

With Regards,
Amit Kapila.

Attachments:

pgbench_wal_modified_and_lz_fixed_test.htmtext/html; name=pgbench_wal_modified_and_lz_fixed_test.htmDownload

pgbench_fixed.capplication/octet-stream; name=pgbench_fixed.cDownload

/*
 * pgbench.c
 *
 * A simple benchmark program for PostgreSQL
 * Originally written by Tatsuo Ishii and enhanced by many contributors.
 *
 * contrib/pgbench/pgbench.c
 * Copyright (c) 2000-2012, PostgreSQL Global Development Group
 * ALL RIGHTS RESERVED;
 *
 * Permission to use, copy, modify, and distribute this software and its
 * documentation for any purpose, without fee, and without a written agreement
 * is hereby granted, provided that the above copyright notice and this
 * paragraph and the following two paragraphs appear in all copies.
 *
 * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
 * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
 * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
 * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
 * POSSIBILITY OF SUCH DAMAGE.
 *
 * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIMS ANY WARRANTIES,
 * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
 * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
 * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
 * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
 *
 */

#ifdef WIN32
#define FD_SETSIZE 1024			/* set before winsock2.h is included */
#endif   /* ! WIN32 */

#include "postgres_fe.h"

#include "getopt_long.h"
#include "libpq-fe.h"
#include "libpq/pqsignal.h"
#include "portability/instr_time.h"

#include <ctype.h>

#ifndef WIN32
#include <sys/time.h>
#include <unistd.h>
#endif   /* ! WIN32 */

#ifdef HAVE_SYS_SELECT_H
#include <sys/select.h>
#endif

#ifdef HAVE_SYS_RESOURCE_H
#include <sys/resource.h>		/* for getrlimit */
#endif

#ifndef INT64_MAX
#define INT64_MAX	INT64CONST(0x7FFFFFFFFFFFFFFF)
#endif

/*
 * Multi-platform pthread implementations
 */

#ifdef WIN32
/* Use native win32 threads on Windows */
typedef struct win32_pthread *pthread_t;
typedef int pthread_attr_t;

static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
static int	pthread_join(pthread_t th, void **thread_return);
#elif defined(ENABLE_THREAD_SAFETY)
/* Use platform-dependent pthread capability */
#include <pthread.h>
#else
/* Use emulation with fork. Rename pthread identifiers to avoid conflicts */

#include <sys/wait.h>

#define pthread_t				pg_pthread_t
#define pthread_attr_t			pg_pthread_attr_t
#define pthread_create			pg_pthread_create
#define pthread_join			pg_pthread_join

typedef struct fork_pthread *pthread_t;
typedef int pthread_attr_t;

static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
static int	pthread_join(pthread_t th, void **thread_return);
#endif

extern char *optarg;
extern int	optind;


/********************************************************************
 * some configurable parameters */

/* max number of clients allowed */
#ifdef FD_SETSIZE
#define MAXCLIENTS	(FD_SETSIZE - 10)
#else
#define MAXCLIENTS	1024
#endif

#define DEFAULT_NXACTS	10		/* default nxacts */

int			nxacts = 0;			/* number of transactions per client */
int			duration = 0;		/* duration in seconds */

/*
 * scaling factor. for example, scale = 10 will make 1000000 tuples in
 * pgbench_accounts table.
 */
int			scale = 1;

/*
 * fillfactor. for example, fillfactor = 90 will use only 90 percent
 * space during inserts and leave 10 percent free.
 */
int			fillfactor = 100;

/*
 * create foreign key constraints on the tables?
 */
int			foreign_keys = 0;

/*
 * use unlogged tables?
 */
int			unlogged_tables = 0;

/*
 * tablespace selection
 */
char	   *tablespace = NULL;
char	   *index_tablespace = NULL;

/*
 * end of configurable parameters
 *********************************************************************/

#define nbranches	1			/* Makes little sense to change this.  Change
								 * -s instead */
#define ntellers	10
#define naccounts	100000

bool		use_log;			/* log transaction latencies to a file */
bool		is_connect;			/* establish connection for each transaction */
bool		is_latencies;		/* report per-command latencies */
int			main_pid;			/* main process id used in log filename */

char	   *pghost = "";
char	   *pgport = "";
char	   *login = NULL;
char	   *dbName;
const char *progname;

volatile bool timer_exceeded = false;	/* flag from signal handler */

/* variable definitions */
typedef struct
{
	char	   *name;			/* variable name */
	char	   *value;			/* its value */
} Variable;

#define MAX_FILES		128		/* max number of SQL script files allowed */
#define SHELL_COMMAND_SIZE	256 /* maximum size allowed for shell command */

/*
 * structures used in custom query mode
 */

typedef struct
{
	PGconn	   *con;			/* connection handle to DB */
	int			id;				/* client No. */
	int			state;			/* state No. */
	int			cnt;			/* xacts count */
	int			ecnt;			/* error count */
	int			listen;			/* 0 indicates that an async query has been
								 * sent */
	int			sleeping;		/* 1 indicates that the client is napping */
	int64		until;			/* napping until (usec) */
	Variable   *variables;		/* array of variable definitions */
	int			nvariables;
	instr_time	txn_begin;		/* used for measuring transaction latencies */
	instr_time	stmt_begin;		/* used for measuring statement latencies */
	int			use_file;		/* index in sql_files for this client */
	bool		prepared[MAX_FILES];
} CState;

/*
 * Thread state and result
 */
typedef struct
{
	int			tid;			/* thread id */
	pthread_t	thread;			/* thread handle */
	CState	   *state;			/* array of CState */
	int			nstate;			/* length of state[] */
	instr_time	start_time;		/* thread start time */
	instr_time *exec_elapsed;	/* time spent executing cmds (per Command) */
	int		   *exec_count;		/* number of cmd executions (per Command) */
	unsigned short random_state[3];		/* separate randomness for each thread */
} TState;

#define INVALID_THREAD		((pthread_t) 0)

typedef struct
{
	instr_time	conn_time;
	int			xacts;
} TResult;

/*
 * queries read from files
 */
#define SQL_COMMAND		1
#define META_COMMAND	2
#define MAX_ARGS		10

typedef enum QueryMode
{
	QUERY_SIMPLE,				/* simple query */
	QUERY_EXTENDED,				/* extended query */
	QUERY_PREPARED,				/* extended query with prepared statements */
	NUM_QUERYMODE
} QueryMode;

static QueryMode querymode = QUERY_SIMPLE;
static const char *QUERYMODE[] = {"simple", "extended", "prepared"};

typedef struct
{
	char	   *line;			/* full text of command line */
	int			command_num;	/* unique index of this Command struct */
	int			type;			/* command type (SQL_COMMAND or META_COMMAND) */
	int			argc;			/* number of command words */
	char	   *argv[MAX_ARGS]; /* command word list */
} Command;

static Command **sql_files[MAX_FILES];	/* SQL script files */
static int	num_files;			/* number of script files */
static int	num_commands = 0;	/* total number of Command structs */
static int	debug = 0;			/* debug flag */

/* default scenario */
static char *tpc_b = {
	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"\\setrandom bid 1 :nbranches\n"
	"\\setrandom tid 1 :ntellers\n"
	"\\setrandom delta -5000 5000\n"
	"BEGIN;\n"
	"UPDATE pgbench_accounts SET abalance = abalance + :delta,"
	"filler = \'abcdefghijkABCDEFGHIJK :delta \',"
	" filler1 = \'lmnopqrstuvwxyz :delta\' WHERE aid = :aid;\n"
	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta,"
	"filler = \'abcdefghijkABCDEFGHIJK :delta \',"
	" filler1 = \'lmnopqrstuvwxyz :delta\' WHERE tid = :tid;\n"
	"UPDATE pgbench_branches SET bbalance = bbalance + :delta,"
	"filler = \'abcdefghijkABCDEFGHIJK :delta \',"
	" filler1 = \'lmnopqrstuvwxyz :delta\' WHERE bid = :bid;\n"
	"END;\n"
};

/* -N case */
static char *simple_update = {
	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"\\setrandom bid 1 :nbranches\n"
	"\\setrandom tid 1 :ntellers\n"
	"\\setrandom delta -5000 5000\n"
	"BEGIN;\n"
	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
	"END;\n"
};

/* -S case */
static char *select_only = {
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
};

/* Function prototypes */
static void setalarm(int seconds);
static void *threadRun(void *arg);


/*
 * routines to check mem allocations and fail noisily.
 */
static void *
xmalloc(size_t size)
{
	void	   *result;

	result = malloc(size);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}

static void *
xrealloc(void *ptr, size_t size)
{
	void	   *result;

	result = realloc(ptr, size);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}

static char *
xstrdup(const char *s)
{
	char	   *result;

	result = strdup(s);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}


static void
usage(void)
{
	printf("%s is a benchmarking tool for PostgreSQL.\n\n"
		   "Usage:\n"
		   "  %s [OPTION]... [DBNAME]\n"
		   "\nInitialization options:\n"
		   "  -i           invokes initialization mode\n"
		   "  -n           do not run VACUUM after initialization\n"
		   "  -F NUM       fill factor\n"
		   "  -s NUM       scaling factor\n"
		   "  --foreign-keys\n"
		   "               create foreign key constraints between tables\n"
		   "  --index-tablespace=TABLESPACE\n"
		   "               create indexes in the specified tablespace\n"
		   "  --tablespace=TABLESPACE\n"
		   "               create tables in the specified tablespace\n"
		   "  --unlogged-tables\n"
		   "               create tables as unlogged tables\n"
		   "\nBenchmarking options:\n"
		"  -c NUM       number of concurrent database clients (default: 1)\n"
		   "  -C           establish new connection for each transaction\n"
		   "  -D VARNAME=VALUE\n"
		   "               define variable for use by custom script\n"
		   "  -f FILENAME  read transaction script from FILENAME\n"
		   "  -j NUM       number of threads (default: 1)\n"
		   "  -l           write transaction times to log file\n"
		   "  -M simple|extended|prepared\n"
		   "               protocol for submitting queries to server (default: simple)\n"
		   "  -n           do not run VACUUM before tests\n"
		   "  -N           do not update tables \"pgbench_tellers\" and \"pgbench_branches\"\n"
		   "  -r           report average latency per command\n"
		   "  -s NUM       report this scale factor in output\n"
		   "  -S           perform SELECT-only transactions\n"
	 "  -t NUM       number of transactions each client runs (default: 10)\n"
		   "  -T NUM       duration of benchmark test in seconds\n"
		   "  -v           vacuum all four standard tables before tests\n"
		   "\nCommon options:\n"
		   "  -d             print debugging output\n"
		   "  -h HOSTNAME    database server host or socket directory\n"
		   "  -p PORT        database server port number\n"
		   "  -U USERNAME    connect as specified database user\n"
		   "  -V, --version  output version information, then exit\n"
		   "  -?, --help     show this help, then exit\n"
		   "\n"
		   "Report bugs to <pgsql-bugs@postgresql.org>.\n",
		   progname, progname);
}

/* random number generator: uniform distribution from min to max inclusive */
static int
getrand(TState *thread, int min, int max)
{
	/*
	 * Odd coding is so that min and max have approximately the same chance of
	 * being selected as do numbers between them.
	 *
	 * pg_erand48() is thread-safe and concurrent, which is why we use it
	 * rather than random(), which in glibc is non-reentrant, and therefore
	 * protected by a mutex, and therefore a bottleneck on machines with many
	 * CPUs.
	 */
	return min + (int) ((max - min + 1) * pg_erand48(thread->random_state));
}

/* call PQexec() and exit() on failure */
static void
executeStatement(PGconn *con, const char *sql)
{
	PGresult   *res;

	res = PQexec(con, sql);
	if (PQresultStatus(res) != PGRES_COMMAND_OK)
	{
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}
	PQclear(res);
}

/* set up a connection to the backend */
static PGconn *
doConnect(void)
{
	PGconn	   *conn;
	static char *password = NULL;
	bool		new_pass;

	/*
	 * Start the connection.  Loop until we have a password if requested by
	 * backend.
	 */
	do
	{
#define PARAMS_ARRAY_SIZE	7

		const char *keywords[PARAMS_ARRAY_SIZE];
		const char *values[PARAMS_ARRAY_SIZE];

		keywords[0] = "host";
		values[0] = pghost;
		keywords[1] = "port";
		values[1] = pgport;
		keywords[2] = "user";
		values[2] = login;
		keywords[3] = "password";
		values[3] = password;
		keywords[4] = "dbname";
		values[4] = dbName;
		keywords[5] = "fallback_application_name";
		values[5] = progname;
		keywords[6] = NULL;
		values[6] = NULL;

		new_pass = false;

		conn = PQconnectdbParams(keywords, values, true);

		if (!conn)
		{
			fprintf(stderr, "Connection to database \"%s\" failed\n",
					dbName);
			return NULL;
		}

		if (PQstatus(conn) == CONNECTION_BAD &&
			PQconnectionNeedsPassword(conn) &&
			password == NULL)
		{
			PQfinish(conn);
			password = simple_prompt("Password: ", 100, false);
			new_pass = true;
		}
	} while (new_pass);

	/* check to see that the backend connection was successfully made */
	if (PQstatus(conn) == CONNECTION_BAD)
	{
		fprintf(stderr, "Connection to database \"%s\" failed:\n%s",
				dbName, PQerrorMessage(conn));
		PQfinish(conn);
		return NULL;
	}

	return conn;
}

/* throw away response from backend */
static void
discard_response(CState *state)
{
	PGresult   *res;

	do
	{
		res = PQgetResult(state->con);
		if (res)
			PQclear(res);
	} while (res);
}

static int
compareVariables(const void *v1, const void *v2)
{
	return strcmp(((const Variable *) v1)->name,
				  ((const Variable *) v2)->name);
}

static char *
getVariable(CState *st, char *name)
{
	Variable	key,
			   *var;

	/* On some versions of Solaris, bsearch of zero items dumps core */
	if (st->nvariables <= 0)
		return NULL;

	key.name = name;
	var = (Variable *) bsearch((void *) &key,
							   (void *) st->variables,
							   st->nvariables,
							   sizeof(Variable),
							   compareVariables);
	if (var != NULL)
		return var->value;
	else
		return NULL;
}

/* check whether the name consists of alphabets, numerals and underscores. */
static bool
isLegalVariableName(const char *name)
{
	int			i;

	for (i = 0; name[i] != '\0'; i++)
	{
		if (!isalnum((unsigned char) name[i]) && name[i] != '_')
			return false;
	}

	return true;
}

static int
putVariable(CState *st, const char *context, char *name, char *value)
{
	Variable	key,
			   *var;

	key.name = name;
	/* On some versions of Solaris, bsearch of zero items dumps core */
	if (st->nvariables > 0)
		var = (Variable *) bsearch((void *) &key,
								   (void *) st->variables,
								   st->nvariables,
								   sizeof(Variable),
								   compareVariables);
	else
		var = NULL;

	if (var == NULL)
	{
		Variable   *newvars;

		/*
		 * Check for the name only when declaring a new variable to avoid
		 * overhead.
		 */
		if (!isLegalVariableName(name))
		{
			fprintf(stderr, "%s: invalid variable name '%s'\n", context, name);
			return false;
		}

		if (st->variables)
			newvars = (Variable *) xrealloc(st->variables,
									(st->nvariables + 1) * sizeof(Variable));
		else
			newvars = (Variable *) xmalloc(sizeof(Variable));

		st->variables = newvars;

		var = &newvars[st->nvariables];

		var->name = xstrdup(name);
		var->value = xstrdup(value);

		st->nvariables++;

		qsort((void *) st->variables, st->nvariables, sizeof(Variable),
			  compareVariables);
	}
	else
	{
		char	   *val;

		/* dup then free, in case value is pointing at this variable */
		val = xstrdup(value);

		free(var->value);
		var->value = val;
	}

	return true;
}

static char *
parseVariable(const char *sql, int *eaten)
{
	int			i = 0;
	char	   *name;

	do
	{
		i++;
	} while (isalnum((unsigned char) sql[i]) || sql[i] == '_');
	if (i == 1)
		return NULL;

	name = xmalloc(i);
	memcpy(name, &sql[1], i - 1);
	name[i - 1] = '\0';

	*eaten = i;
	return name;
}

static char *
replaceVariable(char **sql, char *param, int len, char *value)
{
	int			valueln = strlen(value);

	if (valueln > len)
	{
		size_t		offset = param - *sql;

		*sql = xrealloc(*sql, strlen(*sql) - len + valueln + 1);
		param = *sql + offset;
	}

	if (valueln != len)
		memmove(param + valueln, param + len, strlen(param + len) + 1);
	strncpy(param, value, valueln);

	return param + valueln;
}

static char *
assignVariables(CState *st, char *sql)
{
	char	   *p,
			   *name,
			   *val;

	p = sql;
	while ((p = strchr(p, ':')) != NULL)
	{
		int			eaten;

		name = parseVariable(p, &eaten);
		if (name == NULL)
		{
			while (*p == ':')
			{
				p++;
			}
			continue;
		}

		val = getVariable(st, name);
		free(name);
		if (val == NULL)
		{
			p++;
			continue;
		}

		p = replaceVariable(&sql, p, eaten, val);
	}

	return sql;
}

static void
getQueryParams(CState *st, const Command *command, const char **params)
{
	int			i;

	for (i = 0; i < command->argc - 1; i++)
		params[i] = getVariable(st, command->argv[i + 1]);
}

/*
 * Run a shell command. The result is assigned to the variable if not NULL.
 * Return true if succeeded, or false on error.
 */
static bool
runShellCommand(CState *st, char *variable, char **argv, int argc)
{
	char		command[SHELL_COMMAND_SIZE];
	int			i,
				len = 0;
	FILE	   *fp;
	char		res[64];
	char	   *endptr;
	int			retval;

	/*----------
	 * Join arguments with whitespace separators. Arguments starting with
	 * exactly one colon are treated as variables:
	 *	name - append a string "name"
	 *	:var - append a variable named 'var'
	 *	::name - append a string ":name"
	 *----------
	 */
	for (i = 0; i < argc; i++)
	{
		char	   *arg;
		int			arglen;

		if (argv[i][0] != ':')
		{
			arg = argv[i];		/* a string literal */
		}
		else if (argv[i][1] == ':')
		{
			arg = argv[i] + 1;	/* a string literal starting with colons */
		}
		else if ((arg = getVariable(st, argv[i] + 1)) == NULL)
		{
			fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[i]);
			return false;
		}

		arglen = strlen(arg);
		if (len + arglen + (i > 0 ? 1 : 0) >= SHELL_COMMAND_SIZE - 1)
		{
			fprintf(stderr, "%s: too long shell command\n", argv[0]);
			return false;
		}

		if (i > 0)
			command[len++] = ' ';
		memcpy(command + len, arg, arglen);
		len += arglen;
	}

	command[len] = '\0';

	/* Fast path for non-assignment case */
	if (variable == NULL)
	{
		if (system(command))
		{
			if (!timer_exceeded)
				fprintf(stderr, "%s: cannot launch shell command\n", argv[0]);
			return false;
		}
		return true;
	}

	/* Execute the command with pipe and read the standard output. */
	if ((fp = popen(command, "r")) == NULL)
	{
		fprintf(stderr, "%s: cannot launch shell command\n", argv[0]);
		return false;
	}
	if (fgets(res, sizeof(res), fp) == NULL)
	{
		if (!timer_exceeded)
			fprintf(stderr, "%s: cannot read the result\n", argv[0]);
		return false;
	}
	if (pclose(fp) < 0)
	{
		fprintf(stderr, "%s: cannot close shell command\n", argv[0]);
		return false;
	}

	/* Check whether the result is an integer and assign it to the variable */
	retval = (int) strtol(res, &endptr, 10);
	while (*endptr != '\0' && isspace((unsigned char) *endptr))
		endptr++;
	if (*res == '\0' || *endptr != '\0')
	{
		fprintf(stderr, "%s: must return an integer ('%s' returned)\n", argv[0], res);
		return false;
	}
	snprintf(res, sizeof(res), "%d", retval);
	if (!putVariable(st, "setshell", variable, res))
		return false;

#ifdef DEBUG
	printf("shell parameter name: %s, value: %s\n", argv[1], res);
#endif
	return true;
}

#define MAX_PREPARE_NAME		32
static void
preparedStatementName(char *buffer, int file, int state)
{
	sprintf(buffer, "P%d_%d", file, state);
}

static bool
clientDone(CState *st, bool ok)
{
	(void) ok;					/* unused */

	if (st->con != NULL)
	{
		PQfinish(st->con);
		st->con = NULL;
	}
	return false;				/* always false */
}

/* return false iff client should be disconnected */
static bool
doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile)
{
	PGresult   *res;
	Command   **commands;

top:
	commands = sql_files[st->use_file];

	if (st->sleeping)
	{							/* are we sleeping? */
		instr_time	now;

		INSTR_TIME_SET_CURRENT(now);
		if (st->until <= INSTR_TIME_GET_MICROSEC(now))
			st->sleeping = 0;	/* Done sleeping, go ahead with next command */
		else
			return true;		/* Still sleeping, nothing to do here */
	}

	if (st->listen)
	{							/* are we receiver? */
		if (commands[st->state]->type == SQL_COMMAND)
		{
			if (debug)
				fprintf(stderr, "client %d receiving\n", st->id);
			if (!PQconsumeInput(st->con))
			{					/* there's something wrong */
				fprintf(stderr, "Client %d aborted in state %d. Probably the backend died while processing.\n", st->id, st->state);
				return clientDone(st, false);
			}
			if (PQisBusy(st->con))
				return true;	/* don't have the whole result yet */
		}

		/*
		 * command finished: accumulate per-command execution times in
		 * thread-local data structure, if per-command latencies are requested
		 */
		if (is_latencies)
		{
			instr_time	now;
			int			cnum = commands[st->state]->command_num;

			INSTR_TIME_SET_CURRENT(now);
			INSTR_TIME_ACCUM_DIFF(thread->exec_elapsed[cnum],
								  now, st->stmt_begin);
			thread->exec_count[cnum]++;
		}

		/*
		 * if transaction finished, record the time it took in the log
		 */
		if (logfile && commands[st->state + 1] == NULL)
		{
			instr_time	now;
			instr_time	diff;
			double		usec;

			INSTR_TIME_SET_CURRENT(now);
			diff = now;
			INSTR_TIME_SUBTRACT(diff, st->txn_begin);
			usec = (double) INSTR_TIME_GET_MICROSEC(diff);

#ifndef WIN32
			/* This is more than we really ought to know about instr_time */
			fprintf(logfile, "%d %d %.0f %d %ld %ld\n",
					st->id, st->cnt, usec, st->use_file,
					(long) now.tv_sec, (long) now.tv_usec);
#else
			/* On Windows, instr_time doesn't provide a timestamp anyway */
			fprintf(logfile, "%d %d %.0f %d 0 0\n",
					st->id, st->cnt, usec, st->use_file);
#endif
		}

		if (commands[st->state]->type == SQL_COMMAND)
		{
			/*
			 * Read and discard the query result; note this is not included in
			 * the statement latency numbers.
			 */
			res = PQgetResult(st->con);
			switch (PQresultStatus(res))
			{
				case PGRES_COMMAND_OK:
				case PGRES_TUPLES_OK:
					break;		/* OK */
				default:
					fprintf(stderr, "Client %d aborted in state %d: %s",
							st->id, st->state, PQerrorMessage(st->con));
					PQclear(res);
					return clientDone(st, false);
			}
			PQclear(res);
			discard_response(st);
		}

		if (commands[st->state + 1] == NULL)
		{
			if (is_connect)
			{
				PQfinish(st->con);
				st->con = NULL;
			}

			++st->cnt;
			if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
				return clientDone(st, true);	/* exit success */
		}

		/* increment state counter */
		st->state++;
		if (commands[st->state] == NULL)
		{
			st->state = 0;
			st->use_file = getrand(thread, 0, num_files - 1);
			commands = sql_files[st->use_file];
		}
	}

	if (st->con == NULL)
	{
		instr_time	start,
					end;

		INSTR_TIME_SET_CURRENT(start);
		if ((st->con = doConnect()) == NULL)
		{
			fprintf(stderr, "Client %d aborted in establishing connection.\n", st->id);
			return clientDone(st, false);
		}
		INSTR_TIME_SET_CURRENT(end);
		INSTR_TIME_ACCUM_DIFF(*conn_time, end, start);
	}

	/* Record transaction start time if logging is enabled */
	if (logfile && st->state == 0)
		INSTR_TIME_SET_CURRENT(st->txn_begin);

	/* Record statement start time if per-command latencies are requested */
	if (is_latencies)
		INSTR_TIME_SET_CURRENT(st->stmt_begin);

	if (commands[st->state]->type == SQL_COMMAND)
	{
		const Command *command = commands[st->state];
		int			r;

		if (querymode == QUERY_SIMPLE)
		{
			char	   *sql;

			sql = xstrdup(command->argv[0]);
			sql = assignVariables(st, sql);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, sql);
			r = PQsendQuery(st->con, sql);
			free(sql);
		}
		else if (querymode == QUERY_EXTENDED)
		{
			const char *sql = command->argv[0];
			const char *params[MAX_ARGS];

			getQueryParams(st, command, params);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, sql);
			r = PQsendQueryParams(st->con, sql, command->argc - 1,
								  NULL, params, NULL, NULL, 0);
		}
		else if (querymode == QUERY_PREPARED)
		{
			char		name[MAX_PREPARE_NAME];
			const char *params[MAX_ARGS];

			if (!st->prepared[st->use_file])
			{
				int			j;

				for (j = 0; commands[j] != NULL; j++)
				{
					PGresult   *res;
					char		name[MAX_PREPARE_NAME];

					if (commands[j]->type != SQL_COMMAND)
						continue;
					preparedStatementName(name, st->use_file, j);
					res = PQprepare(st->con, name,
						  commands[j]->argv[0], commands[j]->argc - 1, NULL);
					if (PQresultStatus(res) != PGRES_COMMAND_OK)
						fprintf(stderr, "%s", PQerrorMessage(st->con));
					PQclear(res);
				}
				st->prepared[st->use_file] = true;
			}

			getQueryParams(st, command, params);
			preparedStatementName(name, st->use_file, st->state);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, name);
			r = PQsendQueryPrepared(st->con, name, command->argc - 1,
									params, NULL, NULL, 0);
		}
		else	/* unknown sql mode */
			r = 0;

		if (r == 0)
		{
			if (debug)
				fprintf(stderr, "client %d cannot send %s\n", st->id, command->argv[0]);
			st->ecnt++;
		}
		else
			st->listen = 1;		/* flags that should be listened */
	}
	else if (commands[st->state]->type == META_COMMAND)
	{
		int			argc = commands[st->state]->argc,
					i;
		char	  **argv = commands[st->state]->argv;

		if (debug)
		{
			fprintf(stderr, "client %d executing \\%s", st->id, argv[0]);
			for (i = 1; i < argc; i++)
				fprintf(stderr, " %s", argv[i]);
			fprintf(stderr, "\n");
		}

		if (pg_strcasecmp(argv[0], "setrandom") == 0)
		{
			char	   *var;
			int			min,
						max;
			char		res[64];

			if (*argv[2] == ':')
			{
				if ((var = getVariable(st, argv[2] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[2]);
					st->ecnt++;
					return true;
				}
				min = atoi(var);
			}
			else
				min = atoi(argv[2]);

#ifdef NOT_USED
			if (min < 0)
			{
				fprintf(stderr, "%s: invalid minimum number %d\n", argv[0], min);
				st->ecnt++;
				return;
			}
#endif

			if (*argv[3] == ':')
			{
				if ((var = getVariable(st, argv[3] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[3]);
					st->ecnt++;
					return true;
				}
				max = atoi(var);
			}
			else
				max = atoi(argv[3]);

			if (max < min)
			{
				fprintf(stderr, "%s: maximum is less than minimum\n", argv[0]);
				st->ecnt++;
				return true;
			}

			/*
			 * getrand() neeeds to be able to subtract max from min and add
			 * one the result without overflowing.	Since we know max > min,
			 * we can detect overflow just by checking for a negative result.
			 * But we must check both that the subtraction doesn't overflow,
			 * and that adding one to the result doesn't overflow either.
			 */
			if (max - min < 0 || (max - min) + 1 < 0)
			{
				fprintf(stderr, "%s: range too large\n", argv[0]);
				st->ecnt++;
				return true;
			}

#ifdef DEBUG
			printf("min: %d max: %d random: %d\n", min, max, getrand(thread, min, max));
#endif
			snprintf(res, sizeof(res), "%d", getrand(thread, min, max));

			if (!putVariable(st, argv[0], argv[1], res))
			{
				st->ecnt++;
				return true;
			}

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "set") == 0)
		{
			char	   *var;
			int			ope1,
						ope2;
			char		res[64];

			if (*argv[2] == ':')
			{
				if ((var = getVariable(st, argv[2] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[2]);
					st->ecnt++;
					return true;
				}
				ope1 = atoi(var);
			}
			else
				ope1 = atoi(argv[2]);

			if (argc < 5)
				snprintf(res, sizeof(res), "%d", ope1);
			else
			{
				if (*argv[4] == ':')
				{
					if ((var = getVariable(st, argv[4] + 1)) == NULL)
					{
						fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[4]);
						st->ecnt++;
						return true;
					}
					ope2 = atoi(var);
				}
				else
					ope2 = atoi(argv[4]);

				if (strcmp(argv[3], "+") == 0)
					snprintf(res, sizeof(res), "%d", ope1 + ope2);
				else if (strcmp(argv[3], "-") == 0)
					snprintf(res, sizeof(res), "%d", ope1 - ope2);
				else if (strcmp(argv[3], "*") == 0)
					snprintf(res, sizeof(res), "%d", ope1 * ope2);
				else if (strcmp(argv[3], "/") == 0)
				{
					if (ope2 == 0)
					{
						fprintf(stderr, "%s: division by zero\n", argv[0]);
						st->ecnt++;
						return true;
					}
					snprintf(res, sizeof(res), "%d", ope1 / ope2);
				}
				else
				{
					fprintf(stderr, "%s: unsupported operator %s\n", argv[0], argv[3]);
					st->ecnt++;
					return true;
				}
			}

			if (!putVariable(st, argv[0], argv[1], res))
			{
				st->ecnt++;
				return true;
			}

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "sleep") == 0)
		{
			char	   *var;
			int			usec;
			instr_time	now;

			if (*argv[1] == ':')
			{
				if ((var = getVariable(st, argv[1] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[1]);
					st->ecnt++;
					return true;
				}
				usec = atoi(var);
			}
			else
				usec = atoi(argv[1]);

			if (argc > 2)
			{
				if (pg_strcasecmp(argv[2], "ms") == 0)
					usec *= 1000;
				else if (pg_strcasecmp(argv[2], "s") == 0)
					usec *= 1000000;
			}
			else
				usec *= 1000000;

			INSTR_TIME_SET_CURRENT(now);
			st->until = INSTR_TIME_GET_MICROSEC(now) + usec;
			st->sleeping = 1;

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "setshell") == 0)
		{
			bool		ret = runShellCommand(st, argv[1], argv + 2, argc - 2);

			if (timer_exceeded) /* timeout */
				return clientDone(st, true);
			else if (!ret)		/* on error */
			{
				st->ecnt++;
				return true;
			}
			else	/* succeeded */
				st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "shell") == 0)
		{
			bool		ret = runShellCommand(st, NULL, argv + 1, argc - 1);

			if (timer_exceeded) /* timeout */
				return clientDone(st, true);
			else if (!ret)		/* on error */
			{
				st->ecnt++;
				return true;
			}
			else	/* succeeded */
				st->listen = 1;
		}
		goto top;
	}

	return true;
}

/* discard connections */
static void
disconnect_all(CState *state, int length)
{
	int			i;

	for (i = 0; i < length; i++)
	{
		if (state[i].con)
		{
			PQfinish(state[i].con);
			state[i].con = NULL;
		}
	}
}

/* create tables and setup data */
static void
init(bool is_no_vacuum)
{
	/*
	 * Note: TPC-B requires at least 100 bytes per row, and the "filler"
	 * fields in these table declarations were intended to comply with that.
	 * But because they default to NULLs, they don't actually take any space.
	 * We could fix that by giving them non-null default values. However, that
	 * would completely break comparability of pgbench results with prior
	 * versions.  Since pgbench has never pretended to be fully TPC-B
	 * compliant anyway, we stick with the historical behavior.
	 */
	struct ddlinfo
	{
		char	   *table;
		char	   *cols;
		int			declare_fillfactor;
	};
	struct ddlinfo DDLs[] = {
		{
			"pgbench_history",
			"tid int,bid int,aid int,delta int,mtime timestamp,filler char(22)",
			0
		},
		{
			"pgbench_tellers",
			"tid int not null,bid int,tbalance int,filler char(92),"
			"tbalance1 int, filler1 varchar(150),tbalance2 int,filler2 char(1550)",
			1
		},
		{
			"pgbench_accounts",
			"aid int not null,bid int,abalance int,filler char(92),"
			"abalance1 int,filler1 varchar(150),abalance2 int,filler2 char(1550)",
			1
		},
		{
			"pgbench_branches",
			"bid int not null,bbalance int,filler char(92),bbalance1 int,"
			"filler1 varchar(150), bbalance2 int, filler2 char(1550)",
			1
		}
	};
	static char *DDLAFTERs[] = {
		"alter table pgbench_branches add primary key (bid)",
		"alter table pgbench_tellers add primary key (tid)",
		"alter table pgbench_accounts add primary key (aid)"
	};
	static char *DDLKEYs[] = {
		"alter table pgbench_tellers add foreign key (bid) references pgbench_branches",
		"alter table pgbench_accounts add foreign key (bid) references pgbench_branches",
		"alter table pgbench_history add foreign key (bid) references pgbench_branches",
		"alter table pgbench_history add foreign key (tid) references pgbench_tellers",
		"alter table pgbench_history add foreign key (aid) references pgbench_accounts"
	};

	PGconn	   *con;
	PGresult   *res;
	char		sql[256];
	int			i;

	if ((con = doConnect()) == NULL)
		exit(1);

	for (i = 0; i < lengthof(DDLs); i++)
	{
		char		opts[256];
		char		buffer[256];
		struct ddlinfo *ddl = &DDLs[i];

		/* Remove old table, if it exists. */
		snprintf(buffer, 256, "drop table if exists %s", ddl->table);
		executeStatement(con, buffer);

		/* Construct new create table statement. */
		opts[0] = '\0';
		if (ddl->declare_fillfactor)
			snprintf(opts + strlen(opts), 256 - strlen(opts),
					 " with (fillfactor=%d)", fillfactor);
		if (tablespace != NULL)
		{
			char	   *escape_tablespace;

			escape_tablespace = PQescapeIdentifier(con, tablespace,
												   strlen(tablespace));
			snprintf(opts + strlen(opts), 256 - strlen(opts),
					 " tablespace %s", escape_tablespace);
			PQfreemem(escape_tablespace);
		}
		snprintf(buffer, 256, "create%s table %s(%s)%s",
				 unlogged_tables ? " unlogged" : "",
				 ddl->table, ddl->cols, opts);

		executeStatement(con, buffer);
	}

	executeStatement(con, "begin");

	for (i = 0; i < nbranches * scale; i++)
	{
		snprintf(sql, 256, "insert into pgbench_branches values(%d,0,0,0,0,0,0)", i + 1);
		executeStatement(con, sql);
	}

	for (i = 0; i < ntellers * scale; i++)
	{
		snprintf(sql, 256, "insert into pgbench_tellers values (%d,%d,0,0,0,0,0,0)",
				 i + 1, i / ntellers + 1);
		executeStatement(con, sql);
	}

	executeStatement(con, "commit");

	/*
	 * fill the pgbench_accounts table with some data
	 */
	fprintf(stderr, "creating tables...\n");

	executeStatement(con, "begin");
	executeStatement(con, "truncate pgbench_accounts");

	res = PQexec(con, "copy pgbench_accounts from stdin");
	if (PQresultStatus(res) != PGRES_COPY_IN)
	{
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}
	PQclear(res);

	for (i = 0; i < naccounts * scale; i++)
	{
		int			j = i + 1;

		snprintf(sql, 256, "%d\t%d\t%d\t \t%d\t \t%d\t \n", j, i / naccounts + 1, 0,0,0);
		if (PQputline(con, sql))
		{
			fprintf(stderr, "PQputline failed\n");
			exit(1);
		}

		if (j % 100000 == 0)
			fprintf(stderr, "%d of %d tuples (%d%%) done.\n",
					j, naccounts * scale,
					j * 100 / (naccounts * scale));
	}
	if (PQputline(con, "\\.\n"))
	{
		fprintf(stderr, "very last PQputline failed\n");
		exit(1);
	}
	if (PQendcopy(con))
	{
		fprintf(stderr, "PQendcopy failed\n");
		exit(1);
	}
	executeStatement(con, "commit");

	/* vacuum */
	if (!is_no_vacuum)
	{
		fprintf(stderr, "vacuum...\n");
		executeStatement(con, "vacuum analyze pgbench_branches");
		executeStatement(con, "vacuum analyze pgbench_tellers");
		executeStatement(con, "vacuum analyze pgbench_accounts");
		executeStatement(con, "vacuum analyze pgbench_history");
	}

	/*
	 * create indexes
	 */
	fprintf(stderr, "set primary keys...\n");
	for (i = 0; i < lengthof(DDLAFTERs); i++)
	{
		char		buffer[256];

		strncpy(buffer, DDLAFTERs[i], 256);

		if (index_tablespace != NULL)
		{
			char	   *escape_tablespace;

			escape_tablespace = PQescapeIdentifier(con, index_tablespace,
												   strlen(index_tablespace));
			snprintf(buffer + strlen(buffer), 256 - strlen(buffer),
					 " using index tablespace %s", escape_tablespace);
			PQfreemem(escape_tablespace);
		}

		executeStatement(con, buffer);
	}

	/*
	 * create foreign keys
	 */
	if (foreign_keys)
	{
		fprintf(stderr, "set foreign keys...\n");
		for (i = 0; i < lengthof(DDLKEYs); i++)
		{
			executeStatement(con, DDLKEYs[i]);
		}
	}


	fprintf(stderr, "done.\n");
	PQfinish(con);
}

/*
 * Parse the raw sql and replace :param to $n.
 */
static bool
parseQuery(Command *cmd, const char *raw_sql)
{
	char	   *sql,
			   *p;

	sql = xstrdup(raw_sql);
	cmd->argc = 1;

	p = sql;
	while ((p = strchr(p, ':')) != NULL)
	{
		char		var[12];
		char	   *name;
		int			eaten;

		name = parseVariable(p, &eaten);
		if (name == NULL)
		{
			while (*p == ':')
			{
				p++;
			}
			continue;
		}

		if (cmd->argc >= MAX_ARGS)
		{
			fprintf(stderr, "statement has too many arguments (maximum is %d): %s\n", MAX_ARGS - 1, raw_sql);
			return false;
		}

		sprintf(var, "$%d", cmd->argc);
		p = replaceVariable(&sql, p, eaten, var);

		cmd->argv[cmd->argc] = name;
		cmd->argc++;
	}

	cmd->argv[0] = sql;
	return true;
}

/* Parse a command; return a Command struct, or NULL if it's a comment */
static Command *
process_commands(char *buf)
{
	const char	delim[] = " \f\n\r\t\v";

	Command    *my_commands;
	int			j;
	char	   *p,
			   *tok;

	/* Make the string buf end at the next newline */
	if ((p = strchr(buf, '\n')) != NULL)
		*p = '\0';

	/* Skip leading whitespace */
	p = buf;
	while (isspace((unsigned char) *p))
		p++;

	/* If the line is empty or actually a comment, we're done */
	if (*p == '\0' || strncmp(p, "--", 2) == 0)
		return NULL;

	/* Allocate and initialize Command structure */
	my_commands = (Command *) xmalloc(sizeof(Command));
	my_commands->line = xstrdup(buf);
	my_commands->command_num = num_commands++;
	my_commands->type = 0;		/* until set */
	my_commands->argc = 0;

	if (*p == '\\')
	{
		my_commands->type = META_COMMAND;

		j = 0;
		tok = strtok(++p, delim);

		while (tok != NULL)
		{
			my_commands->argv[j++] = xstrdup(tok);
			my_commands->argc++;
			tok = strtok(NULL, delim);
		}

		if (pg_strcasecmp(my_commands->argv[0], "setrandom") == 0)
		{
			if (my_commands->argc < 4)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			for (j = 4; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
		{
			if (my_commands->argc < 3)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			for (j = my_commands->argc < 5 ? 3 : 5; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "sleep") == 0)
		{
			if (my_commands->argc < 2)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			/*
			 * Split argument into number and unit to allow "sleep 1ms" etc.
			 * We don't have to terminate the number argument with null
			 * because it will be parsed with atoi, which ignores trailing
			 * non-digit characters.
			 */
			if (my_commands->argv[1][0] != ':')
			{
				char	   *c = my_commands->argv[1];

				while (isdigit((unsigned char) *c))
					c++;
				if (*c)
				{
					my_commands->argv[2] = c;
					if (my_commands->argc < 3)
						my_commands->argc = 3;
				}
			}

			if (my_commands->argc >= 3)
			{
				if (pg_strcasecmp(my_commands->argv[2], "us") != 0 &&
					pg_strcasecmp(my_commands->argv[2], "ms") != 0 &&
					pg_strcasecmp(my_commands->argv[2], "s") != 0)
				{
					fprintf(stderr, "%s: unknown time unit '%s' - must be us, ms or s\n",
							my_commands->argv[0], my_commands->argv[2]);
					exit(1);
				}
			}

			for (j = 3; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "setshell") == 0)
		{
			if (my_commands->argc < 3)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}
		}
		else if (pg_strcasecmp(my_commands->argv[0], "shell") == 0)
		{
			if (my_commands->argc < 1)
			{
				fprintf(stderr, "%s: missing command\n", my_commands->argv[0]);
				exit(1);
			}
		}
		else
		{
			fprintf(stderr, "Invalid command %s\n", my_commands->argv[0]);
			exit(1);
		}
	}
	else
	{
		my_commands->type = SQL_COMMAND;

		switch (querymode)
		{
			case QUERY_SIMPLE:
				my_commands->argv[0] = xstrdup(p);
				my_commands->argc++;
				break;
			case QUERY_EXTENDED:
			case QUERY_PREPARED:
				if (!parseQuery(my_commands, p))
					exit(1);
				break;
			default:
				exit(1);
		}
	}

	return my_commands;
}

static int
process_file(char *filename)
{
#define COMMANDS_ALLOC_NUM 128

	Command   **my_commands;
	FILE	   *fd;
	int			lineno;
	char		buf[BUFSIZ];
	int			alloc_num;

	if (num_files >= MAX_FILES)
	{
		fprintf(stderr, "Up to only %d SQL files are allowed\n", MAX_FILES);
		exit(1);
	}

	alloc_num = COMMANDS_ALLOC_NUM;
	my_commands = (Command **) xmalloc(sizeof(Command *) * alloc_num);

	if (strcmp(filename, "-") == 0)
		fd = stdin;
	else if ((fd = fopen(filename, "r")) == NULL)
	{
		fprintf(stderr, "%s: %s\n", filename, strerror(errno));
		return false;
	}

	lineno = 0;

	while (fgets(buf, sizeof(buf), fd) != NULL)
	{
		Command    *command;

		command = process_commands(buf);
		if (command == NULL)
			continue;

		my_commands[lineno] = command;
		lineno++;

		if (lineno >= alloc_num)
		{
			alloc_num += COMMANDS_ALLOC_NUM;
			my_commands = xrealloc(my_commands, sizeof(Command *) * alloc_num);
		}
	}
	fclose(fd);

	my_commands[lineno] = NULL;

	sql_files[num_files++] = my_commands;

	return true;
}

static Command **
process_builtin(char *tb)
{
#define COMMANDS_ALLOC_NUM 128

	Command   **my_commands;
	int			lineno;
	char		buf[BUFSIZ];
	int			alloc_num;

	alloc_num = COMMANDS_ALLOC_NUM;
	my_commands = (Command **) xmalloc(sizeof(Command *) * alloc_num);

	lineno = 0;

	for (;;)
	{
		char	   *p;
		Command    *command;

		p = buf;
		while (*tb && *tb != '\n')
			*p++ = *tb++;

		if (*tb == '\0')
			break;

		if (*tb == '\n')
			tb++;

		*p = '\0';

		command = process_commands(buf);
		if (command == NULL)
			continue;

		my_commands[lineno] = command;
		lineno++;

		if (lineno >= alloc_num)
		{
			alloc_num += COMMANDS_ALLOC_NUM;
			my_commands = xrealloc(my_commands, sizeof(Command *) * alloc_num);
		}
	}

	my_commands[lineno] = NULL;

	return my_commands;
}

/* print out results */
static void
printResults(int ttype, int normal_xacts, int nclients,
			 TState *threads, int nthreads,
			 instr_time total_time, instr_time conn_total_time)
{
	double		time_include,
				tps_include,
				tps_exclude;
	char	   *s;

	time_include = INSTR_TIME_GET_DOUBLE(total_time);
	tps_include = normal_xacts / time_include;
	tps_exclude = normal_xacts / (time_include -
						(INSTR_TIME_GET_DOUBLE(conn_total_time) / nthreads));

	if (ttype == 0)
		s = "TPC-B (sort of)";
	else if (ttype == 2)
		s = "Update only pgbench_accounts";
	else if (ttype == 1)
		s = "SELECT only";
	else
		s = "Custom query";

	printf("transaction type: %s\n", s);
	printf("scaling factor: %d\n", scale);
	printf("query mode: %s\n", QUERYMODE[querymode]);
	printf("number of clients: %d\n", nclients);
	printf("number of threads: %d\n", nthreads);
	if (duration <= 0)
	{
		printf("number of transactions per client: %d\n", nxacts);
		printf("number of transactions actually processed: %d/%d\n",
			   normal_xacts, nxacts * nclients);
	}
	else
	{
		printf("duration: %d s\n", duration);
		printf("number of transactions actually processed: %d\n",
			   normal_xacts);
	}
	printf("tps = %f (including connections establishing)\n", tps_include);
	printf("tps = %f (excluding connections establishing)\n", tps_exclude);

	/* Report per-command latencies */
	if (is_latencies)
	{
		int			i;

		for (i = 0; i < num_files; i++)
		{
			Command   **commands;

			if (num_files > 1)
				printf("statement latencies in milliseconds, file %d:\n", i + 1);
			else
				printf("statement latencies in milliseconds:\n");

			for (commands = sql_files[i]; *commands != NULL; commands++)
			{
				Command    *command = *commands;
				int			cnum = command->command_num;
				double		total_time;
				instr_time	total_exec_elapsed;
				int			total_exec_count;
				int			t;

				/* Accumulate per-thread data for command */
				INSTR_TIME_SET_ZERO(total_exec_elapsed);
				total_exec_count = 0;
				for (t = 0; t < nthreads; t++)
				{
					TState	   *thread = &threads[t];

					INSTR_TIME_ADD(total_exec_elapsed,
								   thread->exec_elapsed[cnum]);
					total_exec_count += thread->exec_count[cnum];
				}

				if (total_exec_count > 0)
					total_time = INSTR_TIME_GET_MILLISEC(total_exec_elapsed) / (double) total_exec_count;
				else
					total_time = 0.0;

				printf("\t%f\t%s\n", total_time, command->line);
			}
		}
	}
}


int
main(int argc, char **argv)
{
	int			c;
	int			nclients = 1;	/* default number of simulated clients */
	int			nthreads = 1;	/* default number of threads */
	int			is_init_mode = 0;		/* initialize mode? */
	int			is_no_vacuum = 0;		/* no vacuum at all before testing? */
	int			do_vacuum_accounts = 0; /* do vacuum accounts before testing? */
	int			ttype = 0;		/* transaction type. 0: TPC-B, 1: SELECT only,
								 * 2: skip update of branches and tellers */
	int			optindex;
	char	   *filename = NULL;
	bool		scale_given = false;

	CState	   *state;			/* status of clients */
	TState	   *threads;		/* array of thread */

	instr_time	start_time;		/* start up time */
	instr_time	total_time;
	instr_time	conn_total_time;
	int			total_xacts;

	int			i;

	static struct option long_options[] = {
		{"foreign-keys", no_argument, &foreign_keys, 1},
		{"index-tablespace", required_argument, NULL, 3},
		{"tablespace", required_argument, NULL, 2},
		{"unlogged-tables", no_argument, &unlogged_tables, 1},
		{NULL, 0, NULL, 0}
	};

#ifdef HAVE_GETRLIMIT
	struct rlimit rlim;
#endif

	PGconn	   *con;
	PGresult   *res;
	char	   *env;

	char		val[64];

	progname = get_progname(argv[0]);

	if (argc > 1)
	{
		if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-?") == 0)
		{
			usage();
			exit(0);
		}
		if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
		{
			puts("pgbench (PostgreSQL) " PG_VERSION);
			exit(0);
		}
	}

#ifdef WIN32
	/* stderr is buffered on Win32. */
	setvbuf(stderr, NULL, _IONBF, 0);
#endif

	if ((env = getenv("PGHOST")) != NULL && *env != '\0')
		pghost = env;
	if ((env = getenv("PGPORT")) != NULL && *env != '\0')
		pgport = env;
	else if ((env = getenv("PGUSER")) != NULL && *env != '\0')
		login = env;

	state = (CState *) xmalloc(sizeof(CState));
	memset(state, 0, sizeof(CState));

	while ((c = getopt_long(argc, argv, "ih:nvp:dSNc:j:Crs:t:T:U:lf:D:F:M:", long_options, &optindex)) != -1)
	{
		switch (c)
		{
			case 'i':
				is_init_mode++;
				break;
			case 'h':
				pghost = optarg;
				break;
			case 'n':
				is_no_vacuum++;
				break;
			case 'v':
				do_vacuum_accounts++;
				break;
			case 'p':
				pgport = optarg;
				break;
			case 'd':
				debug++;
				break;
			case 'S':
				ttype = 1;
				break;
			case 'N':
				ttype = 2;
				break;
			case 'c':
				nclients = atoi(optarg);
				if (nclients <= 0 || nclients > MAXCLIENTS)
				{
					fprintf(stderr, "invalid number of clients: %d\n", nclients);
					exit(1);
				}
#ifdef HAVE_GETRLIMIT
#ifdef RLIMIT_NOFILE			/* most platforms use RLIMIT_NOFILE */
				if (getrlimit(RLIMIT_NOFILE, &rlim) == -1)
#else							/* but BSD doesn't ... */
				if (getrlimit(RLIMIT_OFILE, &rlim) == -1)
#endif   /* RLIMIT_NOFILE */
				{
					fprintf(stderr, "getrlimit failed: %s\n", strerror(errno));
					exit(1);
				}
				if (rlim.rlim_cur <= (nclients + 2))
				{
					fprintf(stderr, "You need at least %d open files but you are only allowed to use %ld.\n", nclients + 2, (long) rlim.rlim_cur);
					fprintf(stderr, "Use limit/ulimit to increase the limit before using pgbench.\n");
					exit(1);
				}
#endif   /* HAVE_GETRLIMIT */
				break;
			case 'j':			/* jobs */
				nthreads = atoi(optarg);
				if (nthreads <= 0)
				{
					fprintf(stderr, "invalid number of threads: %d\n", nthreads);
					exit(1);
				}
				break;
			case 'C':
				is_connect = true;
				break;
			case 'r':
				is_latencies = true;
				break;
			case 's':
				scale_given = true;
				scale = atoi(optarg);
				if (scale <= 0)
				{
					fprintf(stderr, "invalid scaling factor: %d\n", scale);
					exit(1);
				}
				break;
			case 't':
				if (duration > 0)
				{
					fprintf(stderr, "specify either a number of transactions (-t) or a duration (-T), not both.\n");
					exit(1);
				}
				nxacts = atoi(optarg);
				if (nxacts <= 0)
				{
					fprintf(stderr, "invalid number of transactions: %d\n", nxacts);
					exit(1);
				}
				break;
			case 'T':
				if (nxacts > 0)
				{
					fprintf(stderr, "specify either a number of transactions (-t) or a duration (-T), not both.\n");
					exit(1);
				}
				duration = atoi(optarg);
				if (duration <= 0)
				{
					fprintf(stderr, "invalid duration: %d\n", duration);
					exit(1);
				}
				break;
			case 'U':
				login = optarg;
				break;
			case 'l':
				use_log = true;
				break;
			case 'f':
				ttype = 3;
				filename = optarg;
				if (process_file(filename) == false || *sql_files[num_files - 1] == NULL)
					exit(1);
				break;
			case 'D':
				{
					char	   *p;

					if ((p = strchr(optarg, '=')) == NULL || p == optarg || *(p + 1) == '\0')
					{
						fprintf(stderr, "invalid variable definition: %s\n", optarg);
						exit(1);
					}

					*p++ = '\0';
					if (!putVariable(&state[0], "option", optarg, p))
						exit(1);
				}
				break;
			case 'F':
				fillfactor = atoi(optarg);
				if ((fillfactor < 10) || (fillfactor > 100))
				{
					fprintf(stderr, "invalid fillfactor: %d\n", fillfactor);
					exit(1);
				}
				break;
			case 'M':
				if (num_files > 0)
				{
					fprintf(stderr, "query mode (-M) should be specifiled before transaction scripts (-f)\n");
					exit(1);
				}
				for (querymode = 0; querymode < NUM_QUERYMODE; querymode++)
					if (strcmp(optarg, QUERYMODE[querymode]) == 0)
						break;
				if (querymode >= NUM_QUERYMODE)
				{
					fprintf(stderr, "invalid query mode (-M): %s\n", optarg);
					exit(1);
				}
				break;
			case 0:
				/* This covers long options which take no argument. */
				break;
			case 2:				/* tablespace */
				tablespace = optarg;
				break;
			case 3:				/* index-tablespace */
				index_tablespace = optarg;
				break;
			default:
				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
				exit(1);
				break;
		}
	}

	if (argc > optind)
		dbName = argv[optind];
	else
	{
		if ((env = getenv("PGDATABASE")) != NULL && *env != '\0')
			dbName = env;
		else if (login != NULL && *login != '\0')
			dbName = login;
		else
			dbName = "";
	}

	if (is_init_mode)
	{
		init(is_no_vacuum);
		exit(0);
	}

	/* Use DEFAULT_NXACTS if neither nxacts nor duration is specified. */
	if (nxacts <= 0 && duration <= 0)
		nxacts = DEFAULT_NXACTS;

	if (nclients % nthreads != 0)
	{
		fprintf(stderr, "number of clients (%d) must be a multiple of number of threads (%d)\n", nclients, nthreads);
		exit(1);
	}

	/*
	 * is_latencies only works with multiple threads in thread-based
	 * implementations, not fork-based ones, because it supposes that the
	 * parent can see changes made to the per-thread execution stats by child
	 * threads.  It seems useful enough to accept despite this limitation, but
	 * perhaps we should FIXME someday (by passing the stats data back up
	 * through the parent-to-child pipes).
	 */
#ifndef ENABLE_THREAD_SAFETY
	if (is_latencies && nthreads > 1)
	{
		fprintf(stderr, "-r does not work with -j larger than 1 on this platform.\n");
		exit(1);
	}
#endif

	/*
	 * save main process id in the global variable because process id will be
	 * changed after fork.
	 */
	main_pid = (int) getpid();

	if (nclients > 1)
	{
		state = (CState *) xrealloc(state, sizeof(CState) * nclients);
		memset(state + 1, 0, sizeof(CState) * (nclients - 1));

		/* copy any -D switch values to all clients */
		for (i = 1; i < nclients; i++)
		{
			int			j;

			state[i].id = i;
			for (j = 0; j < state[0].nvariables; j++)
			{
				if (!putVariable(&state[i], "startup", state[0].variables[j].name, state[0].variables[j].value))
					exit(1);
			}
		}
	}

	if (debug)
	{
		if (duration <= 0)
			printf("pghost: %s pgport: %s nclients: %d nxacts: %d dbName: %s\n",
				   pghost, pgport, nclients, nxacts, dbName);
		else
			printf("pghost: %s pgport: %s nclients: %d duration: %d dbName: %s\n",
				   pghost, pgport, nclients, duration, dbName);
	}

	/* opening connection... */
	con = doConnect();
	if (con == NULL)
		exit(1);

	if (PQstatus(con) == CONNECTION_BAD)
	{
		fprintf(stderr, "Connection to database '%s' failed.\n", dbName);
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}

	if (ttype != 3)
	{
		/*
		 * get the scaling factor that should be same as count(*) from
		 * pgbench_branches if this is not a custom query
		 */
		res = PQexec(con, "select count(*) from pgbench_branches");
		if (PQresultStatus(res) != PGRES_TUPLES_OK)
		{
			fprintf(stderr, "%s", PQerrorMessage(con));
			exit(1);
		}
		scale = atoi(PQgetvalue(res, 0, 0));
		if (scale < 0)
		{
			fprintf(stderr, "count(*) from pgbench_branches invalid (%d)\n", scale);
			exit(1);
		}
		PQclear(res);

		/* warn if we override user-given -s switch */
		if (scale_given)
			fprintf(stderr,
			"Scale option ignored, using pgbench_branches table count = %d\n",
					scale);
	}

	/*
	 * :scale variables normally get -s or database scale, but don't override
	 * an explicit -D switch
	 */
	if (getVariable(&state[0], "scale") == NULL)
	{
		snprintf(val, sizeof(val), "%d", scale);
		for (i = 0; i < nclients; i++)
		{
			if (!putVariable(&state[i], "startup", "scale", val))
				exit(1);
		}
	}

	if (!is_no_vacuum)
	{
		fprintf(stderr, "starting vacuum...");
		executeStatement(con, "vacuum pgbench_branches");
		executeStatement(con, "vacuum pgbench_tellers");
		executeStatement(con, "truncate pgbench_history");
		fprintf(stderr, "end.\n");

		if (do_vacuum_accounts)
		{
			fprintf(stderr, "starting vacuum pgbench_accounts...");
			executeStatement(con, "vacuum analyze pgbench_accounts");
			fprintf(stderr, "end.\n");
		}
	}
	PQfinish(con);

	/* set random seed */
	INSTR_TIME_SET_CURRENT(start_time);
	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));

	/* process builtin SQL scripts */
	switch (ttype)
	{
		case 0:
			sql_files[0] = process_builtin(tpc_b);
			num_files = 1;
			break;

		case 1:
			sql_files[0] = process_builtin(select_only);
			num_files = 1;
			break;

		case 2:
			sql_files[0] = process_builtin(simple_update);
			num_files = 1;
			break;

		default:
			break;
	}

	/* set up thread data structures */
	threads = (TState *) xmalloc(sizeof(TState) * nthreads);
	for (i = 0; i < nthreads; i++)
	{
		TState	   *thread = &threads[i];

		thread->tid = i;
		thread->state = &state[nclients / nthreads * i];
		thread->nstate = nclients / nthreads;
		thread->random_state[0] = random();
		thread->random_state[1] = random();
		thread->random_state[2] = random();

		if (is_latencies)
		{
			/* Reserve memory for the thread to store per-command latencies */
			int			t;

			thread->exec_elapsed = (instr_time *)
				xmalloc(sizeof(instr_time) * num_commands);
			thread->exec_count = (int *)
				xmalloc(sizeof(int) * num_commands);

			for (t = 0; t < num_commands; t++)
			{
				INSTR_TIME_SET_ZERO(thread->exec_elapsed[t]);
				thread->exec_count[t] = 0;
			}
		}
		else
		{
			thread->exec_elapsed = NULL;
			thread->exec_count = NULL;
		}
	}

	/* get start up time */
	INSTR_TIME_SET_CURRENT(start_time);

	/* set alarm if duration is specified. */
	if (duration > 0)
		setalarm(duration);

	/* start threads */
	for (i = 0; i < nthreads; i++)
	{
		TState	   *thread = &threads[i];

		INSTR_TIME_SET_CURRENT(thread->start_time);

		/* the first thread (i = 0) is executed by main thread */
		if (i > 0)
		{
			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);

			if (err != 0 || thread->thread == INVALID_THREAD)
			{
				fprintf(stderr, "cannot create thread: %s\n", strerror(err));
				exit(1);
			}
		}
		else
		{
			thread->thread = INVALID_THREAD;
		}
	}

	/* wait for threads and accumulate results */
	total_xacts = 0;
	INSTR_TIME_SET_ZERO(conn_total_time);
	for (i = 0; i < nthreads; i++)
	{
		void	   *ret = NULL;

		if (threads[i].thread == INVALID_THREAD)
			ret = threadRun(&threads[i]);
		else
			pthread_join(threads[i].thread, &ret);

		if (ret != NULL)
		{
			TResult    *r = (TResult *) ret;

			total_xacts += r->xacts;
			INSTR_TIME_ADD(conn_total_time, r->conn_time);
			free(ret);
		}
	}
	disconnect_all(state, nclients);

	/* get end time */
	INSTR_TIME_SET_CURRENT(total_time);
	INSTR_TIME_SUBTRACT(total_time, start_time);
	printResults(ttype, total_xacts, nclients, threads, nthreads,
				 total_time, conn_total_time);

	return 0;
}

static void *
threadRun(void *arg)
{
	TState	   *thread = (TState *) arg;
	CState	   *state = thread->state;
	TResult    *result;
	FILE	   *logfile = NULL; /* per-thread log file */
	instr_time	start,
				end;
	int			nstate = thread->nstate;
	int			remains = nstate;		/* number of remaining clients */
	int			i;

	result = xmalloc(sizeof(TResult));
	INSTR_TIME_SET_ZERO(result->conn_time);

	/* open log file if requested */
	if (use_log)
	{
		char		logpath[64];

		if (thread->tid == 0)
			snprintf(logpath, sizeof(logpath), "pgbench_log.%d", main_pid);
		else
			snprintf(logpath, sizeof(logpath), "pgbench_log.%d.%d", main_pid, thread->tid);
		logfile = fopen(logpath, "w");

		if (logfile == NULL)
		{
			fprintf(stderr, "Couldn't open logfile \"%s\": %s", logpath, strerror(errno));
			goto done;
		}
	}

	if (!is_connect)
	{
		/* make connections to the database */
		for (i = 0; i < nstate; i++)
		{
			if ((state[i].con = doConnect()) == NULL)
				goto done;
		}
	}

	/* time after thread and connections set up */
	INSTR_TIME_SET_CURRENT(result->conn_time);
	INSTR_TIME_SUBTRACT(result->conn_time, thread->start_time);

	/* send start up queries in async manner */
	for (i = 0; i < nstate; i++)
	{
		CState	   *st = &state[i];
		Command   **commands = sql_files[st->use_file];
		int			prev_ecnt = st->ecnt;

		st->use_file = getrand(thread, 0, num_files - 1);
		if (!doCustom(thread, st, &result->conn_time, logfile))
			remains--;			/* I've aborted */

		if (st->ecnt > prev_ecnt && commands[st->state]->type == META_COMMAND)
		{
			fprintf(stderr, "Client %d aborted in state %d. Execution meta-command failed.\n", i, st->state);
			remains--;			/* I've aborted */
			PQfinish(st->con);
			st->con = NULL;
		}
	}

	while (remains > 0)
	{
		fd_set		input_mask;
		int			maxsock;	/* max socket number to be waited */
		int64		now_usec = 0;
		int64		min_usec;

		FD_ZERO(&input_mask);

		maxsock = -1;
		min_usec = INT64_MAX;
		for (i = 0; i < nstate; i++)
		{
			CState	   *st = &state[i];
			Command   **commands = sql_files[st->use_file];
			int			sock;

			if (st->sleeping)
			{
				int			this_usec;

				if (min_usec == INT64_MAX)
				{
					instr_time	now;

					INSTR_TIME_SET_CURRENT(now);
					now_usec = INSTR_TIME_GET_MICROSEC(now);
				}

				this_usec = st->until - now_usec;
				if (min_usec > this_usec)
					min_usec = this_usec;
			}
			else if (st->con == NULL)
			{
				continue;
			}
			else if (commands[st->state]->type == META_COMMAND)
			{
				min_usec = 0;	/* the connection is ready to run */
				break;
			}

			sock = PQsocket(st->con);
			if (sock < 0)
			{
				fprintf(stderr, "bad socket: %s\n", strerror(errno));
				goto done;
			}

			FD_SET(sock, &input_mask);

			if (maxsock < sock)
				maxsock = sock;
		}

		if (min_usec > 0 && maxsock != -1)
		{
			int			nsocks; /* return from select(2) */

			if (min_usec != INT64_MAX)
			{
				struct timeval timeout;

				timeout.tv_sec = min_usec / 1000000;
				timeout.tv_usec = min_usec % 1000000;
				nsocks = select(maxsock + 1, &input_mask, NULL, NULL, &timeout);
			}
			else
				nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL);
			if (nsocks < 0)
			{
				if (errno == EINTR)
					continue;
				/* must be something wrong */
				fprintf(stderr, "select failed: %s\n", strerror(errno));
				goto done;
			}
		}

		/* ok, backend returns reply */
		for (i = 0; i < nstate; i++)
		{
			CState	   *st = &state[i];
			Command   **commands = sql_files[st->use_file];
			int			prev_ecnt = st->ecnt;

			if (st->con && (FD_ISSET(PQsocket(st->con), &input_mask)
							|| commands[st->state]->type == META_COMMAND))
			{
				if (!doCustom(thread, st, &result->conn_time, logfile))
					remains--;	/* I've aborted */
			}

			if (st->ecnt > prev_ecnt && commands[st->state]->type == META_COMMAND)
			{
				fprintf(stderr, "Client %d aborted in state %d. Execution of meta-command failed.\n", i, st->state);
				remains--;		/* I've aborted */
				PQfinish(st->con);
				st->con = NULL;
			}
		}
	}

done:
	INSTR_TIME_SET_CURRENT(start);
	disconnect_all(state, nstate);
	result->xacts = 0;
	for (i = 0; i < nstate; i++)
		result->xacts += state[i].cnt;
	INSTR_TIME_SET_CURRENT(end);
	INSTR_TIME_ACCUM_DIFF(result->conn_time, end, start);
	if (logfile)
		fclose(logfile);
	return result;
}


/*
 * Support for duration option: set timer_exceeded after so many seconds.
 */

#ifndef WIN32

static void
handle_sig_alarm(SIGNAL_ARGS)
{
	timer_exceeded = true;
}

static void
setalarm(int seconds)
{
	pqsignal(SIGALRM, handle_sig_alarm);
	alarm(seconds);
}

#ifndef ENABLE_THREAD_SAFETY

/*
 * implements pthread using fork.
 */

typedef struct fork_pthread
{
	pid_t		pid;
	int			pipes[2];
}	fork_pthread;

static int
pthread_create(pthread_t *thread,
			   pthread_attr_t *attr,
			   void *(*start_routine) (void *),
			   void *arg)
{
	fork_pthread *th;
	void	   *ret;

	th = (fork_pthread *) xmalloc(sizeof(fork_pthread));
	if (pipe(th->pipes) < 0)
	{
		free(th);
		return errno;
	}

	th->pid = fork();
	if (th->pid == -1)			/* error */
	{
		free(th);
		return errno;
	}
	if (th->pid != 0)			/* in parent process */
	{
		close(th->pipes[1]);
		*thread = th;
		return 0;
	}

	/* in child process */
	close(th->pipes[0]);

	/* set alarm again because the child does not inherit timers */
	if (duration > 0)
		setalarm(duration);

	ret = start_routine(arg);
	write(th->pipes[1], ret, sizeof(TResult));
	close(th->pipes[1]);
	free(th);
	exit(0);
}

static int
pthread_join(pthread_t th, void **thread_return)
{
	int			status;

	while (waitpid(th->pid, &status, 0) != th->pid)
	{
		if (errno != EINTR)
			return errno;
	}

	if (thread_return != NULL)
	{
		/* assume result is TResult */
		*thread_return = xmalloc(sizeof(TResult));
		if (read(th->pipes[0], *thread_return, sizeof(TResult)) != sizeof(TResult))
		{
			free(*thread_return);
			*thread_return = NULL;
		}
	}
	close(th->pipes[0]);

	free(th);
	return 0;
}
#endif
#else							/* WIN32 */

static VOID CALLBACK
win32_timer_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
{
	timer_exceeded = true;
}

static void
setalarm(int seconds)
{
	HANDLE		queue;
	HANDLE		timer;

	/* This function will be called at most once, so we can cheat a bit. */
	queue = CreateTimerQueue();
	if (seconds > ((DWORD) -1) / 1000 ||
		!CreateTimerQueueTimer(&timer, queue,
							   win32_timer_callback, NULL, seconds * 1000, 0,
							   WT_EXECUTEINTIMERTHREAD | WT_EXECUTEONLYONCE))
	{
		fprintf(stderr, "Failed to set timer\n");
		exit(1);
	}
}

/* partial pthread implementation for Windows */

typedef struct win32_pthread
{
	HANDLE		handle;
	void	   *(*routine) (void *);
	void	   *arg;
	void	   *result;
} win32_pthread;

static unsigned __stdcall
win32_pthread_run(void *arg)
{
	win32_pthread *th = (win32_pthread *) arg;

	th->result = th->routine(th->arg);

	return 0;
}

static int
pthread_create(pthread_t *thread,
			   pthread_attr_t *attr,
			   void *(*start_routine) (void *),
			   void *arg)
{
	int			save_errno;
	win32_pthread *th;

	th = (win32_pthread *) xmalloc(sizeof(win32_pthread));
	th->routine = start_routine;
	th->arg = arg;
	th->result = NULL;

	th->handle = (HANDLE) _beginthreadex(NULL, 0, win32_pthread_run, th, 0, NULL);
	if (th->handle == NULL)
	{
		save_errno = errno;
		free(th);
		return save_errno;
	}

	*thread = th;
	return 0;
}

static int
pthread_join(pthread_t th, void **thread_return)
{
	if (th == NULL || th->handle == NULL)
		return errno = EINVAL;

	if (WaitForSingleObject(th->handle, INFINITE) != WAIT_OBJECT_0)
	{
		_dosmaperr(GetLastError());
		return errno;
	}

	if (thread_return)
		*thread_return = th->result;

	CloseHandle(th->handle);
	free(th);
	return 0;
}

#endif   /* WIN32 */

pgbench_random.capplication/octet-stream; name=pgbench_random.cDownload

/*
 * pgbench.c
 *
 * A simple benchmark program for PostgreSQL
 * Originally written by Tatsuo Ishii and enhanced by many contributors.
 *
 * contrib/pgbench/pgbench.c
 * Copyright (c) 2000-2012, PostgreSQL Global Development Group
 * ALL RIGHTS RESERVED;
 *
 * Permission to use, copy, modify, and distribute this software and its
 * documentation for any purpose, without fee, and without a written agreement
 * is hereby granted, provided that the above copyright notice and this
 * paragraph and the following two paragraphs appear in all copies.
 *
 * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
 * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
 * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
 * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
 * POSSIBILITY OF SUCH DAMAGE.
 *
 * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIMS ANY WARRANTIES,
 * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
 * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
 * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
 * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
 *
 */

#ifdef WIN32
#define FD_SETSIZE 1024			/* set before winsock2.h is included */
#endif   /* ! WIN32 */

#include "postgres_fe.h"

#include "getopt_long.h"
#include "libpq-fe.h"
#include "libpq/pqsignal.h"
#include "portability/instr_time.h"

#include <ctype.h>

#ifndef WIN32
#include <sys/time.h>
#include <unistd.h>
#endif   /* ! WIN32 */

#ifdef HAVE_SYS_SELECT_H
#include <sys/select.h>
#endif

#ifdef HAVE_SYS_RESOURCE_H
#include <sys/resource.h>		/* for getrlimit */
#endif

#ifndef INT64_MAX
#define INT64_MAX	INT64CONST(0x7FFFFFFFFFFFFFFF)
#endif

/*
 * Multi-platform pthread implementations
 */

#ifdef WIN32
/* Use native win32 threads on Windows */
typedef struct win32_pthread *pthread_t;
typedef int pthread_attr_t;

static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
static int	pthread_join(pthread_t th, void **thread_return);
#elif defined(ENABLE_THREAD_SAFETY)
/* Use platform-dependent pthread capability */
#include <pthread.h>
#else
/* Use emulation with fork. Rename pthread identifiers to avoid conflicts */

#include <sys/wait.h>

#define pthread_t				pg_pthread_t
#define pthread_attr_t			pg_pthread_attr_t
#define pthread_create			pg_pthread_create
#define pthread_join			pg_pthread_join

typedef struct fork_pthread *pthread_t;
typedef int pthread_attr_t;

static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
static int	pthread_join(pthread_t th, void **thread_return);
#endif

extern char *optarg;
extern int	optind;


/********************************************************************
 * some configurable parameters */

/* max number of clients allowed */
#ifdef FD_SETSIZE
#define MAXCLIENTS	(FD_SETSIZE - 10)
#else
#define MAXCLIENTS	1024
#endif

#define DEFAULT_NXACTS	10		/* default nxacts */

int			nxacts = 0;			/* number of transactions per client */
int			duration = 0;		/* duration in seconds */

/*
 * scaling factor. for example, scale = 10 will make 1000000 tuples in
 * pgbench_accounts table.
 */
int			scale = 1;

/*
 * fillfactor. for example, fillfactor = 90 will use only 90 percent
 * space during inserts and leave 10 percent free.
 */
int			fillfactor = 100;

/*
 * create foreign key constraints on the tables?
 */
int			foreign_keys = 0;

/*
 * use unlogged tables?
 */
int			unlogged_tables = 0;

/*
 * tablespace selection
 */
char	   *tablespace = NULL;
char	   *index_tablespace = NULL;

/*
 * end of configurable parameters
 *********************************************************************/

#define nbranches	1			/* Makes little sense to change this.  Change
								 * -s instead */
#define ntellers	10
#define naccounts	100000

bool		use_log;			/* log transaction latencies to a file */
bool		is_connect;			/* establish connection for each transaction */
bool		is_latencies;		/* report per-command latencies */
int			main_pid;			/* main process id used in log filename */

char	   *pghost = "";
char	   *pgport = "";
char	   *login = NULL;
char	   *dbName;
const char *progname;

volatile bool timer_exceeded = false;	/* flag from signal handler */

/* variable definitions */
typedef struct
{
	char	   *name;			/* variable name */
	char	   *value;			/* its value */
} Variable;

#define MAX_FILES		128		/* max number of SQL script files allowed */
#define SHELL_COMMAND_SIZE	256 /* maximum size allowed for shell command */

/*
 * structures used in custom query mode
 */

typedef struct
{
	PGconn	   *con;			/* connection handle to DB */
	int			id;				/* client No. */
	int			state;			/* state No. */
	int			cnt;			/* xacts count */
	int			ecnt;			/* error count */
	int			listen;			/* 0 indicates that an async query has been
								 * sent */
	int			sleeping;		/* 1 indicates that the client is napping */
	int64		until;			/* napping until (usec) */
	Variable   *variables;		/* array of variable definitions */
	int			nvariables;
	instr_time	txn_begin;		/* used for measuring transaction latencies */
	instr_time	stmt_begin;		/* used for measuring statement latencies */
	int			use_file;		/* index in sql_files for this client */
	bool		prepared[MAX_FILES];
} CState;

/*
 * Thread state and result
 */
typedef struct
{
	int			tid;			/* thread id */
	pthread_t	thread;			/* thread handle */
	CState	   *state;			/* array of CState */
	int			nstate;			/* length of state[] */
	instr_time	start_time;		/* thread start time */
	instr_time *exec_elapsed;	/* time spent executing cmds (per Command) */
	int		   *exec_count;		/* number of cmd executions (per Command) */
	unsigned short random_state[3];		/* separate randomness for each thread */
} TState;

#define INVALID_THREAD		((pthread_t) 0)

typedef struct
{
	instr_time	conn_time;
	int			xacts;
} TResult;

/*
 * queries read from files
 */
#define SQL_COMMAND		1
#define META_COMMAND	2
#define MAX_ARGS		10

typedef enum QueryMode
{
	QUERY_SIMPLE,				/* simple query */
	QUERY_EXTENDED,				/* extended query */
	QUERY_PREPARED,				/* extended query with prepared statements */
	NUM_QUERYMODE
} QueryMode;

static QueryMode querymode = QUERY_SIMPLE;
static const char *QUERYMODE[] = {"simple", "extended", "prepared"};

typedef struct
{
	char	   *line;			/* full text of command line */
	int			command_num;	/* unique index of this Command struct */
	int			type;			/* command type (SQL_COMMAND or META_COMMAND) */
	int			argc;			/* number of command words */
	char	   *argv[MAX_ARGS]; /* command word list */
} Command;

static Command **sql_files[MAX_FILES];	/* SQL script files */
static int	num_files;			/* number of script files */
static int	num_commands = 0;	/* total number of Command structs */
static int	debug = 0;			/* debug flag */

/* default scenario */
static char *tpc_b = {
	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"\\setrandom bid 1 :nbranches\n"
	"\\setrandom tid 1 :ntellers\n"
	"\\setrandom delta -5000 5000\n"
	"BEGIN;\n"
	"UPDATE pgbench_accounts SET abalance = abalance + :delta,"
	"filler = random()::text,"
	" filler1 = random()::text WHERE aid = :aid;\n"
	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta,"
	"filler = random()::text ,"
	" filler1 = random()::text WHERE tid = :tid;\n"
	"UPDATE pgbench_branches SET bbalance = bbalance + :delta,"
	"filler = random()::text,"
	" filler1 = random()::text WHERE bid = :bid;\n"
	"END;\n"
};

/* -N case */
static char *simple_update = {
	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"\\setrandom bid 1 :nbranches\n"
	"\\setrandom tid 1 :ntellers\n"
	"\\setrandom delta -5000 5000\n"
	"BEGIN;\n"
	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
	"END;\n"
};

/* -S case */
static char *select_only = {
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
};

/* Function prototypes */
static void setalarm(int seconds);
static void *threadRun(void *arg);


/*
 * routines to check mem allocations and fail noisily.
 */
static void *
xmalloc(size_t size)
{
	void	   *result;

	result = malloc(size);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}

static void *
xrealloc(void *ptr, size_t size)
{
	void	   *result;

	result = realloc(ptr, size);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}

static char *
xstrdup(const char *s)
{
	char	   *result;

	result = strdup(s);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}


static void
usage(void)
{
	printf("%s is a benchmarking tool for PostgreSQL.\n\n"
		   "Usage:\n"
		   "  %s [OPTION]... [DBNAME]\n"
		   "\nInitialization options:\n"
		   "  -i           invokes initialization mode\n"
		   "  -n           do not run VACUUM after initialization\n"
		   "  -F NUM       fill factor\n"
		   "  -s NUM       scaling factor\n"
		   "  --foreign-keys\n"
		   "               create foreign key constraints between tables\n"
		   "  --index-tablespace=TABLESPACE\n"
		   "               create indexes in the specified tablespace\n"
		   "  --tablespace=TABLESPACE\n"
		   "               create tables in the specified tablespace\n"
		   "  --unlogged-tables\n"
		   "               create tables as unlogged tables\n"
		   "\nBenchmarking options:\n"
		"  -c NUM       number of concurrent database clients (default: 1)\n"
		   "  -C           establish new connection for each transaction\n"
		   "  -D VARNAME=VALUE\n"
		   "               define variable for use by custom script\n"
		   "  -f FILENAME  read transaction script from FILENAME\n"
		   "  -j NUM       number of threads (default: 1)\n"
		   "  -l           write transaction times to log file\n"
		   "  -M simple|extended|prepared\n"
		   "               protocol for submitting queries to server (default: simple)\n"
		   "  -n           do not run VACUUM before tests\n"
		   "  -N           do not update tables \"pgbench_tellers\" and \"pgbench_branches\"\n"
		   "  -r           report average latency per command\n"
		   "  -s NUM       report this scale factor in output\n"
		   "  -S           perform SELECT-only transactions\n"
	 "  -t NUM       number of transactions each client runs (default: 10)\n"
		   "  -T NUM       duration of benchmark test in seconds\n"
		   "  -v           vacuum all four standard tables before tests\n"
		   "\nCommon options:\n"
		   "  -d             print debugging output\n"
		   "  -h HOSTNAME    database server host or socket directory\n"
		   "  -p PORT        database server port number\n"
		   "  -U USERNAME    connect as specified database user\n"
		   "  -V, --version  output version information, then exit\n"
		   "  -?, --help     show this help, then exit\n"
		   "\n"
		   "Report bugs to <pgsql-bugs@postgresql.org>.\n",
		   progname, progname);
}

/* random number generator: uniform distribution from min to max inclusive */
static int
getrand(TState *thread, int min, int max)
{
	/*
	 * Odd coding is so that min and max have approximately the same chance of
	 * being selected as do numbers between them.
	 *
	 * pg_erand48() is thread-safe and concurrent, which is why we use it
	 * rather than random(), which in glibc is non-reentrant, and therefore
	 * protected by a mutex, and therefore a bottleneck on machines with many
	 * CPUs.
	 */
	return min + (int) ((max - min + 1) * pg_erand48(thread->random_state));
}

/* call PQexec() and exit() on failure */
static void
executeStatement(PGconn *con, const char *sql)
{
	PGresult   *res;

	res = PQexec(con, sql);
	if (PQresultStatus(res) != PGRES_COMMAND_OK)
	{
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}
	PQclear(res);
}

/* set up a connection to the backend */
static PGconn *
doConnect(void)
{
	PGconn	   *conn;
	static char *password = NULL;
	bool		new_pass;

	/*
	 * Start the connection.  Loop until we have a password if requested by
	 * backend.
	 */
	do
	{
#define PARAMS_ARRAY_SIZE	7

		const char *keywords[PARAMS_ARRAY_SIZE];
		const char *values[PARAMS_ARRAY_SIZE];

		keywords[0] = "host";
		values[0] = pghost;
		keywords[1] = "port";
		values[1] = pgport;
		keywords[2] = "user";
		values[2] = login;
		keywords[3] = "password";
		values[3] = password;
		keywords[4] = "dbname";
		values[4] = dbName;
		keywords[5] = "fallback_application_name";
		values[5] = progname;
		keywords[6] = NULL;
		values[6] = NULL;

		new_pass = false;

		conn = PQconnectdbParams(keywords, values, true);

		if (!conn)
		{
			fprintf(stderr, "Connection to database \"%s\" failed\n",
					dbName);
			return NULL;
		}

		if (PQstatus(conn) == CONNECTION_BAD &&
			PQconnectionNeedsPassword(conn) &&
			password == NULL)
		{
			PQfinish(conn);
			password = simple_prompt("Password: ", 100, false);
			new_pass = true;
		}
	} while (new_pass);

	/* check to see that the backend connection was successfully made */
	if (PQstatus(conn) == CONNECTION_BAD)
	{
		fprintf(stderr, "Connection to database \"%s\" failed:\n%s",
				dbName, PQerrorMessage(conn));
		PQfinish(conn);
		return NULL;
	}

	return conn;
}

/* throw away response from backend */
static void
discard_response(CState *state)
{
	PGresult   *res;

	do
	{
		res = PQgetResult(state->con);
		if (res)
			PQclear(res);
	} while (res);
}

static int
compareVariables(const void *v1, const void *v2)
{
	return strcmp(((const Variable *) v1)->name,
				  ((const Variable *) v2)->name);
}

static char *
getVariable(CState *st, char *name)
{
	Variable	key,
			   *var;

	/* On some versions of Solaris, bsearch of zero items dumps core */
	if (st->nvariables <= 0)
		return NULL;

	key.name = name;
	var = (Variable *) bsearch((void *) &key,
							   (void *) st->variables,
							   st->nvariables,
							   sizeof(Variable),
							   compareVariables);
	if (var != NULL)
		return var->value;
	else
		return NULL;
}

/* check whether the name consists of alphabets, numerals and underscores. */
static bool
isLegalVariableName(const char *name)
{
	int			i;

	for (i = 0; name[i] != '\0'; i++)
	{
		if (!isalnum((unsigned char) name[i]) && name[i] != '_')
			return false;
	}

	return true;
}

static int
putVariable(CState *st, const char *context, char *name, char *value)
{
	Variable	key,
			   *var;

	key.name = name;
	/* On some versions of Solaris, bsearch of zero items dumps core */
	if (st->nvariables > 0)
		var = (Variable *) bsearch((void *) &key,
								   (void *) st->variables,
								   st->nvariables,
								   sizeof(Variable),
								   compareVariables);
	else
		var = NULL;

	if (var == NULL)
	{
		Variable   *newvars;

		/*
		 * Check for the name only when declaring a new variable to avoid
		 * overhead.
		 */
		if (!isLegalVariableName(name))
		{
			fprintf(stderr, "%s: invalid variable name '%s'\n", context, name);
			return false;
		}

		if (st->variables)
			newvars = (Variable *) xrealloc(st->variables,
									(st->nvariables + 1) * sizeof(Variable));
		else
			newvars = (Variable *) xmalloc(sizeof(Variable));

		st->variables = newvars;

		var = &newvars[st->nvariables];

		var->name = xstrdup(name);
		var->value = xstrdup(value);

		st->nvariables++;

		qsort((void *) st->variables, st->nvariables, sizeof(Variable),
			  compareVariables);
	}
	else
	{
		char	   *val;

		/* dup then free, in case value is pointing at this variable */
		val = xstrdup(value);

		free(var->value);
		var->value = val;
	}

	return true;
}

static char *
parseVariable(const char *sql, int *eaten)
{
	int			i = 0;
	char	   *name;

	do
	{
		i++;
	} while (isalnum((unsigned char) sql[i]) || sql[i] == '_');
	if (i == 1)
		return NULL;

	name = xmalloc(i);
	memcpy(name, &sql[1], i - 1);
	name[i - 1] = '\0';

	*eaten = i;
	return name;
}

static char *
replaceVariable(char **sql, char *param, int len, char *value)
{
	int			valueln = strlen(value);

	if (valueln > len)
	{
		size_t		offset = param - *sql;

		*sql = xrealloc(*sql, strlen(*sql) - len + valueln + 1);
		param = *sql + offset;
	}

	if (valueln != len)
		memmove(param + valueln, param + len, strlen(param + len) + 1);
	strncpy(param, value, valueln);

	return param + valueln;
}

static char *
assignVariables(CState *st, char *sql)
{
	char	   *p,
			   *name,
			   *val;

	p = sql;
	while ((p = strchr(p, ':')) != NULL)
	{
		int			eaten;

		name = parseVariable(p, &eaten);
		if (name == NULL)
		{
			while (*p == ':')
			{
				p++;
			}
			continue;
		}

		val = getVariable(st, name);
		free(name);
		if (val == NULL)
		{
			p++;
			continue;
		}

		p = replaceVariable(&sql, p, eaten, val);
	}

	return sql;
}

static void
getQueryParams(CState *st, const Command *command, const char **params)
{
	int			i;

	for (i = 0; i < command->argc - 1; i++)
		params[i] = getVariable(st, command->argv[i + 1]);
}

/*
 * Run a shell command. The result is assigned to the variable if not NULL.
 * Return true if succeeded, or false on error.
 */
static bool
runShellCommand(CState *st, char *variable, char **argv, int argc)
{
	char		command[SHELL_COMMAND_SIZE];
	int			i,
				len = 0;
	FILE	   *fp;
	char		res[64];
	char	   *endptr;
	int			retval;

	/*----------
	 * Join arguments with whitespace separators. Arguments starting with
	 * exactly one colon are treated as variables:
	 *	name - append a string "name"
	 *	:var - append a variable named 'var'
	 *	::name - append a string ":name"
	 *----------
	 */
	for (i = 0; i < argc; i++)
	{
		char	   *arg;
		int			arglen;

		if (argv[i][0] != ':')
		{
			arg = argv[i];		/* a string literal */
		}
		else if (argv[i][1] == ':')
		{
			arg = argv[i] + 1;	/* a string literal starting with colons */
		}
		else if ((arg = getVariable(st, argv[i] + 1)) == NULL)
		{
			fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[i]);
			return false;
		}

		arglen = strlen(arg);
		if (len + arglen + (i > 0 ? 1 : 0) >= SHELL_COMMAND_SIZE - 1)
		{
			fprintf(stderr, "%s: too long shell command\n", argv[0]);
			return false;
		}

		if (i > 0)
			command[len++] = ' ';
		memcpy(command + len, arg, arglen);
		len += arglen;
	}

	command[len] = '\0';

	/* Fast path for non-assignment case */
	if (variable == NULL)
	{
		if (system(command))
		{
			if (!timer_exceeded)
				fprintf(stderr, "%s: cannot launch shell command\n", argv[0]);
			return false;
		}
		return true;
	}

	/* Execute the command with pipe and read the standard output. */
	if ((fp = popen(command, "r")) == NULL)
	{
		fprintf(stderr, "%s: cannot launch shell command\n", argv[0]);
		return false;
	}
	if (fgets(res, sizeof(res), fp) == NULL)
	{
		if (!timer_exceeded)
			fprintf(stderr, "%s: cannot read the result\n", argv[0]);
		return false;
	}
	if (pclose(fp) < 0)
	{
		fprintf(stderr, "%s: cannot close shell command\n", argv[0]);
		return false;
	}

	/* Check whether the result is an integer and assign it to the variable */
	retval = (int) strtol(res, &endptr, 10);
	while (*endptr != '\0' && isspace((unsigned char) *endptr))
		endptr++;
	if (*res == '\0' || *endptr != '\0')
	{
		fprintf(stderr, "%s: must return an integer ('%s' returned)\n", argv[0], res);
		return false;
	}
	snprintf(res, sizeof(res), "%d", retval);
	if (!putVariable(st, "setshell", variable, res))
		return false;

#ifdef DEBUG
	printf("shell parameter name: %s, value: %s\n", argv[1], res);
#endif
	return true;
}

#define MAX_PREPARE_NAME		32
static void
preparedStatementName(char *buffer, int file, int state)
{
	sprintf(buffer, "P%d_%d", file, state);
}

static bool
clientDone(CState *st, bool ok)
{
	(void) ok;					/* unused */

	if (st->con != NULL)
	{
		PQfinish(st->con);
		st->con = NULL;
	}
	return false;				/* always false */
}

/* return false iff client should be disconnected */
static bool
doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile)
{
	PGresult   *res;
	Command   **commands;

top:
	commands = sql_files[st->use_file];

	if (st->sleeping)
	{							/* are we sleeping? */
		instr_time	now;

		INSTR_TIME_SET_CURRENT(now);
		if (st->until <= INSTR_TIME_GET_MICROSEC(now))
			st->sleeping = 0;	/* Done sleeping, go ahead with next command */
		else
			return true;		/* Still sleeping, nothing to do here */
	}

	if (st->listen)
	{							/* are we receiver? */
		if (commands[st->state]->type == SQL_COMMAND)
		{
			if (debug)
				fprintf(stderr, "client %d receiving\n", st->id);
			if (!PQconsumeInput(st->con))
			{					/* there's something wrong */
				fprintf(stderr, "Client %d aborted in state %d. Probably the backend died while processing.\n", st->id, st->state);
				return clientDone(st, false);
			}
			if (PQisBusy(st->con))
				return true;	/* don't have the whole result yet */
		}

		/*
		 * command finished: accumulate per-command execution times in
		 * thread-local data structure, if per-command latencies are requested
		 */
		if (is_latencies)
		{
			instr_time	now;
			int			cnum = commands[st->state]->command_num;

			INSTR_TIME_SET_CURRENT(now);
			INSTR_TIME_ACCUM_DIFF(thread->exec_elapsed[cnum],
								  now, st->stmt_begin);
			thread->exec_count[cnum]++;
		}

		/*
		 * if transaction finished, record the time it took in the log
		 */
		if (logfile && commands[st->state + 1] == NULL)
		{
			instr_time	now;
			instr_time	diff;
			double		usec;

			INSTR_TIME_SET_CURRENT(now);
			diff = now;
			INSTR_TIME_SUBTRACT(diff, st->txn_begin);
			usec = (double) INSTR_TIME_GET_MICROSEC(diff);

#ifndef WIN32
			/* This is more than we really ought to know about instr_time */
			fprintf(logfile, "%d %d %.0f %d %ld %ld\n",
					st->id, st->cnt, usec, st->use_file,
					(long) now.tv_sec, (long) now.tv_usec);
#else
			/* On Windows, instr_time doesn't provide a timestamp anyway */
			fprintf(logfile, "%d %d %.0f %d 0 0\n",
					st->id, st->cnt, usec, st->use_file);
#endif
		}

		if (commands[st->state]->type == SQL_COMMAND)
		{
			/*
			 * Read and discard the query result; note this is not included in
			 * the statement latency numbers.
			 */
			res = PQgetResult(st->con);
			switch (PQresultStatus(res))
			{
				case PGRES_COMMAND_OK:
				case PGRES_TUPLES_OK:
					break;		/* OK */
				default:
					fprintf(stderr, "Client %d aborted in state %d: %s",
							st->id, st->state, PQerrorMessage(st->con));
					PQclear(res);
					return clientDone(st, false);
			}
			PQclear(res);
			discard_response(st);
		}

		if (commands[st->state + 1] == NULL)
		{
			if (is_connect)
			{
				PQfinish(st->con);
				st->con = NULL;
			}

			++st->cnt;
			if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
				return clientDone(st, true);	/* exit success */
		}

		/* increment state counter */
		st->state++;
		if (commands[st->state] == NULL)
		{
			st->state = 0;
			st->use_file = getrand(thread, 0, num_files - 1);
			commands = sql_files[st->use_file];
		}
	}

	if (st->con == NULL)
	{
		instr_time	start,
					end;

		INSTR_TIME_SET_CURRENT(start);
		if ((st->con = doConnect()) == NULL)
		{
			fprintf(stderr, "Client %d aborted in establishing connection.\n", st->id);
			return clientDone(st, false);
		}
		INSTR_TIME_SET_CURRENT(end);
		INSTR_TIME_ACCUM_DIFF(*conn_time, end, start);
	}

	/* Record transaction start time if logging is enabled */
	if (logfile && st->state == 0)
		INSTR_TIME_SET_CURRENT(st->txn_begin);

	/* Record statement start time if per-command latencies are requested */
	if (is_latencies)
		INSTR_TIME_SET_CURRENT(st->stmt_begin);

	if (commands[st->state]->type == SQL_COMMAND)
	{
		const Command *command = commands[st->state];
		int			r;

		if (querymode == QUERY_SIMPLE)
		{
			char	   *sql;

			sql = xstrdup(command->argv[0]);
			sql = assignVariables(st, sql);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, sql);
			r = PQsendQuery(st->con, sql);
			free(sql);
		}
		else if (querymode == QUERY_EXTENDED)
		{
			const char *sql = command->argv[0];
			const char *params[MAX_ARGS];

			getQueryParams(st, command, params);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, sql);
			r = PQsendQueryParams(st->con, sql, command->argc - 1,
								  NULL, params, NULL, NULL, 0);
		}
		else if (querymode == QUERY_PREPARED)
		{
			char		name[MAX_PREPARE_NAME];
			const char *params[MAX_ARGS];

			if (!st->prepared[st->use_file])
			{
				int			j;

				for (j = 0; commands[j] != NULL; j++)
				{
					PGresult   *res;
					char		name[MAX_PREPARE_NAME];

					if (commands[j]->type != SQL_COMMAND)
						continue;
					preparedStatementName(name, st->use_file, j);
					res = PQprepare(st->con, name,
						  commands[j]->argv[0], commands[j]->argc - 1, NULL);
					if (PQresultStatus(res) != PGRES_COMMAND_OK)
						fprintf(stderr, "%s", PQerrorMessage(st->con));
					PQclear(res);
				}
				st->prepared[st->use_file] = true;
			}

			getQueryParams(st, command, params);
			preparedStatementName(name, st->use_file, st->state);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, name);
			r = PQsendQueryPrepared(st->con, name, command->argc - 1,
									params, NULL, NULL, 0);
		}
		else	/* unknown sql mode */
			r = 0;

		if (r == 0)
		{
			if (debug)
				fprintf(stderr, "client %d cannot send %s\n", st->id, command->argv[0]);
			st->ecnt++;
		}
		else
			st->listen = 1;		/* flags that should be listened */
	}
	else if (commands[st->state]->type == META_COMMAND)
	{
		int			argc = commands[st->state]->argc,
					i;
		char	  **argv = commands[st->state]->argv;

		if (debug)
		{
			fprintf(stderr, "client %d executing \\%s", st->id, argv[0]);
			for (i = 1; i < argc; i++)
				fprintf(stderr, " %s", argv[i]);
			fprintf(stderr, "\n");
		}

		if (pg_strcasecmp(argv[0], "setrandom") == 0)
		{
			char	   *var;
			int			min,
						max;
			char		res[64];

			if (*argv[2] == ':')
			{
				if ((var = getVariable(st, argv[2] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[2]);
					st->ecnt++;
					return true;
				}
				min = atoi(var);
			}
			else
				min = atoi(argv[2]);

#ifdef NOT_USED
			if (min < 0)
			{
				fprintf(stderr, "%s: invalid minimum number %d\n", argv[0], min);
				st->ecnt++;
				return;
			}
#endif

			if (*argv[3] == ':')
			{
				if ((var = getVariable(st, argv[3] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[3]);
					st->ecnt++;
					return true;
				}
				max = atoi(var);
			}
			else
				max = atoi(argv[3]);

			if (max < min)
			{
				fprintf(stderr, "%s: maximum is less than minimum\n", argv[0]);
				st->ecnt++;
				return true;
			}

			/*
			 * getrand() neeeds to be able to subtract max from min and add
			 * one the result without overflowing.	Since we know max > min,
			 * we can detect overflow just by checking for a negative result.
			 * But we must check both that the subtraction doesn't overflow,
			 * and that adding one to the result doesn't overflow either.
			 */
			if (max - min < 0 || (max - min) + 1 < 0)
			{
				fprintf(stderr, "%s: range too large\n", argv[0]);
				st->ecnt++;
				return true;
			}

#ifdef DEBUG
			printf("min: %d max: %d random: %d\n", min, max, getrand(thread, min, max));
#endif
			snprintf(res, sizeof(res), "%d", getrand(thread, min, max));

			if (!putVariable(st, argv[0], argv[1], res))
			{
				st->ecnt++;
				return true;
			}

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "set") == 0)
		{
			char	   *var;
			int			ope1,
						ope2;
			char		res[64];

			if (*argv[2] == ':')
			{
				if ((var = getVariable(st, argv[2] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[2]);
					st->ecnt++;
					return true;
				}
				ope1 = atoi(var);
			}
			else
				ope1 = atoi(argv[2]);

			if (argc < 5)
				snprintf(res, sizeof(res), "%d", ope1);
			else
			{
				if (*argv[4] == ':')
				{
					if ((var = getVariable(st, argv[4] + 1)) == NULL)
					{
						fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[4]);
						st->ecnt++;
						return true;
					}
					ope2 = atoi(var);
				}
				else
					ope2 = atoi(argv[4]);

				if (strcmp(argv[3], "+") == 0)
					snprintf(res, sizeof(res), "%d", ope1 + ope2);
				else if (strcmp(argv[3], "-") == 0)
					snprintf(res, sizeof(res), "%d", ope1 - ope2);
				else if (strcmp(argv[3], "*") == 0)
					snprintf(res, sizeof(res), "%d", ope1 * ope2);
				else if (strcmp(argv[3], "/") == 0)
				{
					if (ope2 == 0)
					{
						fprintf(stderr, "%s: division by zero\n", argv[0]);
						st->ecnt++;
						return true;
					}
					snprintf(res, sizeof(res), "%d", ope1 / ope2);
				}
				else
				{
					fprintf(stderr, "%s: unsupported operator %s\n", argv[0], argv[3]);
					st->ecnt++;
					return true;
				}
			}

			if (!putVariable(st, argv[0], argv[1], res))
			{
				st->ecnt++;
				return true;
			}

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "sleep") == 0)
		{
			char	   *var;
			int			usec;
			instr_time	now;

			if (*argv[1] == ':')
			{
				if ((var = getVariable(st, argv[1] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[1]);
					st->ecnt++;
					return true;
				}
				usec = atoi(var);
			}
			else
				usec = atoi(argv[1]);

			if (argc > 2)
			{
				if (pg_strcasecmp(argv[2], "ms") == 0)
					usec *= 1000;
				else if (pg_strcasecmp(argv[2], "s") == 0)
					usec *= 1000000;
			}
			else
				usec *= 1000000;

			INSTR_TIME_SET_CURRENT(now);
			st->until = INSTR_TIME_GET_MICROSEC(now) + usec;
			st->sleeping = 1;

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "setshell") == 0)
		{
			bool		ret = runShellCommand(st, argv[1], argv + 2, argc - 2);

			if (timer_exceeded) /* timeout */
				return clientDone(st, true);
			else if (!ret)		/* on error */
			{
				st->ecnt++;
				return true;
			}
			else	/* succeeded */
				st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "shell") == 0)
		{
			bool		ret = runShellCommand(st, NULL, argv + 1, argc - 1);

			if (timer_exceeded) /* timeout */
				return clientDone(st, true);
			else if (!ret)		/* on error */
			{
				st->ecnt++;
				return true;
			}
			else	/* succeeded */
				st->listen = 1;
		}
		goto top;
	}

	return true;
}

/* discard connections */
static void
disconnect_all(CState *state, int length)
{
	int			i;

	for (i = 0; i < length; i++)
	{
		if (state[i].con)
		{
			PQfinish(state[i].con);
			state[i].con = NULL;
		}
	}
}

/* create tables and setup data */
static void
init(bool is_no_vacuum)
{
	/*
	 * Note: TPC-B requires at least 100 bytes per row, and the "filler"
	 * fields in these table declarations were intended to comply with that.
	 * But because they default to NULLs, they don't actually take any space.
	 * We could fix that by giving them non-null default values. However, that
	 * would completely break comparability of pgbench results with prior
	 * versions.  Since pgbench has never pretended to be fully TPC-B
	 * compliant anyway, we stick with the historical behavior.
	 */
	struct ddlinfo
	{
		char	   *table;
		char	   *cols;
		int			declare_fillfactor;
	};
	struct ddlinfo DDLs[] = {
		{
			"pgbench_history",
			"tid int,bid int,aid int,delta int,mtime timestamp,filler char(22)",
			0
		},
		{
			"pgbench_tellers",
			"tid int not null,bid int,tbalance int,filler char(92),"
			"tbalance1 int, filler1 varchar(150),tbalance2 int,filler2 char(1550)",
			1
		},
		{
			"pgbench_accounts",
			"aid int not null,bid int,abalance int,filler char(92),"
			"abalance1 int,filler1 varchar(150),abalance2 int,filler2 char(1550)",
			1
		},
		{
			"pgbench_branches",
			"bid int not null,bbalance int,filler char(92),bbalance1 int,"
			"filler1 varchar(150), bbalance2 int, filler2 char(1550)",
			1
		}
	};
	static char *DDLAFTERs[] = {
		"alter table pgbench_branches add primary key (bid)",
		"alter table pgbench_tellers add primary key (tid)",
		"alter table pgbench_accounts add primary key (aid)"
	};
	static char *DDLKEYs[] = {
		"alter table pgbench_tellers add foreign key (bid) references pgbench_branches",
		"alter table pgbench_accounts add foreign key (bid) references pgbench_branches",
		"alter table pgbench_history add foreign key (bid) references pgbench_branches",
		"alter table pgbench_history add foreign key (tid) references pgbench_tellers",
		"alter table pgbench_history add foreign key (aid) references pgbench_accounts"
	};

	PGconn	   *con;
	PGresult   *res;
	char		sql[256];
	int			i;

	if ((con = doConnect()) == NULL)
		exit(1);

	for (i = 0; i < lengthof(DDLs); i++)
	{
		char		opts[256];
		char		buffer[256];
		struct ddlinfo *ddl = &DDLs[i];

		/* Remove old table, if it exists. */
		snprintf(buffer, 256, "drop table if exists %s", ddl->table);
		executeStatement(con, buffer);

		/* Construct new create table statement. */
		opts[0] = '\0';
		if (ddl->declare_fillfactor)
			snprintf(opts + strlen(opts), 256 - strlen(opts),
					 " with (fillfactor=%d)", fillfactor);
		if (tablespace != NULL)
		{
			char	   *escape_tablespace;

			escape_tablespace = PQescapeIdentifier(con, tablespace,
												   strlen(tablespace));
			snprintf(opts + strlen(opts), 256 - strlen(opts),
					 " tablespace %s", escape_tablespace);
			PQfreemem(escape_tablespace);
		}
		snprintf(buffer, 256, "create%s table %s(%s)%s",
				 unlogged_tables ? " unlogged" : "",
				 ddl->table, ddl->cols, opts);

		executeStatement(con, buffer);
	}

	executeStatement(con, "begin");

	for (i = 0; i < nbranches * scale; i++)
	{
		snprintf(sql, 256, "insert into pgbench_branches values(%d,0,0,0,0,0,0)", i + 1);
		executeStatement(con, sql);
	}

	for (i = 0; i < ntellers * scale; i++)
	{
		snprintf(sql, 256, "insert into pgbench_tellers values (%d,%d,0,0,0,0,0,0)",
				 i + 1, i / ntellers + 1);
		executeStatement(con, sql);
	}

	executeStatement(con, "commit");

	/*
	 * fill the pgbench_accounts table with some data
	 */
	fprintf(stderr, "creating tables...\n");

	executeStatement(con, "begin");
	executeStatement(con, "truncate pgbench_accounts");

	res = PQexec(con, "copy pgbench_accounts from stdin");
	if (PQresultStatus(res) != PGRES_COPY_IN)
	{
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}
	PQclear(res);

	for (i = 0; i < naccounts * scale; i++)
	{
		int			j = i + 1;

		snprintf(sql, 256, "%d\t%d\t%d\t \t%d\t \t%d\t \n", j, i / naccounts + 1, 0,0,0);
		if (PQputline(con, sql))
		{
			fprintf(stderr, "PQputline failed\n");
			exit(1);
		}

		if (j % 100000 == 0)
			fprintf(stderr, "%d of %d tuples (%d%%) done.\n",
					j, naccounts * scale,
					j * 100 / (naccounts * scale));
	}
	if (PQputline(con, "\\.\n"))
	{
		fprintf(stderr, "very last PQputline failed\n");
		exit(1);
	}
	if (PQendcopy(con))
	{
		fprintf(stderr, "PQendcopy failed\n");
		exit(1);
	}
	executeStatement(con, "commit");

	/* vacuum */
	if (!is_no_vacuum)
	{
		fprintf(stderr, "vacuum...\n");
		executeStatement(con, "vacuum analyze pgbench_branches");
		executeStatement(con, "vacuum analyze pgbench_tellers");
		executeStatement(con, "vacuum analyze pgbench_accounts");
		executeStatement(con, "vacuum analyze pgbench_history");
	}

	/*
	 * create indexes
	 */
	fprintf(stderr, "set primary keys...\n");
	for (i = 0; i < lengthof(DDLAFTERs); i++)
	{
		char		buffer[256];

		strncpy(buffer, DDLAFTERs[i], 256);

		if (index_tablespace != NULL)
		{
			char	   *escape_tablespace;

			escape_tablespace = PQescapeIdentifier(con, index_tablespace,
												   strlen(index_tablespace));
			snprintf(buffer + strlen(buffer), 256 - strlen(buffer),
					 " using index tablespace %s", escape_tablespace);
			PQfreemem(escape_tablespace);
		}

		executeStatement(con, buffer);
	}

	/*
	 * create foreign keys
	 */
	if (foreign_keys)
	{
		fprintf(stderr, "set foreign keys...\n");
		for (i = 0; i < lengthof(DDLKEYs); i++)
		{
			executeStatement(con, DDLKEYs[i]);
		}
	}


	fprintf(stderr, "done.\n");
	PQfinish(con);
}

/*
 * Parse the raw sql and replace :param to $n.
 */
static bool
parseQuery(Command *cmd, const char *raw_sql)
{
	char	   *sql,
			   *p;

	sql = xstrdup(raw_sql);
	cmd->argc = 1;

	p = sql;
	while ((p = strchr(p, ':')) != NULL)
	{
		char		var[12];
		char	   *name;
		int			eaten;

		name = parseVariable(p, &eaten);
		if (name == NULL)
		{
			while (*p == ':')
			{
				p++;
			}
			continue;
		}

		if (cmd->argc >= MAX_ARGS)
		{
			fprintf(stderr, "statement has too many arguments (maximum is %d): %s\n", MAX_ARGS - 1, raw_sql);
			return false;
		}

		sprintf(var, "$%d", cmd->argc);
		p = replaceVariable(&sql, p, eaten, var);

		cmd->argv[cmd->argc] = name;
		cmd->argc++;
	}

	cmd->argv[0] = sql;
	return true;
}

/* Parse a command; return a Command struct, or NULL if it's a comment */
static Command *
process_commands(char *buf)
{
	const char	delim[] = " \f\n\r\t\v";

	Command    *my_commands;
	int			j;
	char	   *p,
			   *tok;

	/* Make the string buf end at the next newline */
	if ((p = strchr(buf, '\n')) != NULL)
		*p = '\0';

	/* Skip leading whitespace */
	p = buf;
	while (isspace((unsigned char) *p))
		p++;

	/* If the line is empty or actually a comment, we're done */
	if (*p == '\0' || strncmp(p, "--", 2) == 0)
		return NULL;

	/* Allocate and initialize Command structure */
	my_commands = (Command *) xmalloc(sizeof(Command));
	my_commands->line = xstrdup(buf);
	my_commands->command_num = num_commands++;
	my_commands->type = 0;		/* until set */
	my_commands->argc = 0;

	if (*p == '\\')
	{
		my_commands->type = META_COMMAND;

		j = 0;
		tok = strtok(++p, delim);

		while (tok != NULL)
		{
			my_commands->argv[j++] = xstrdup(tok);
			my_commands->argc++;
			tok = strtok(NULL, delim);
		}

		if (pg_strcasecmp(my_commands->argv[0], "setrandom") == 0)
		{
			if (my_commands->argc < 4)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			for (j = 4; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
		{
			if (my_commands->argc < 3)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			for (j = my_commands->argc < 5 ? 3 : 5; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "sleep") == 0)
		{
			if (my_commands->argc < 2)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			/*
			 * Split argument into number and unit to allow "sleep 1ms" etc.
			 * We don't have to terminate the number argument with null
			 * because it will be parsed with atoi, which ignores trailing
			 * non-digit characters.
			 */
			if (my_commands->argv[1][0] != ':')
			{
				char	   *c = my_commands->argv[1];

				while (isdigit((unsigned char) *c))
					c++;
				if (*c)
				{
					my_commands->argv[2] = c;
					if (my_commands->argc < 3)
						my_commands->argc = 3;
				}
			}

			if (my_commands->argc >= 3)
			{
				if (pg_strcasecmp(my_commands->argv[2], "us") != 0 &&
					pg_strcasecmp(my_commands->argv[2], "ms") != 0 &&
					pg_strcasecmp(my_commands->argv[2], "s") != 0)
				{
					fprintf(stderr, "%s: unknown time unit '%s' - must be us, ms or s\n",
							my_commands->argv[0], my_commands->argv[2]);
					exit(1);
				}
			}

			for (j = 3; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "setshell") == 0)
		{
			if (my_commands->argc < 3)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}
		}
		else if (pg_strcasecmp(my_commands->argv[0], "shell") == 0)
		{
			if (my_commands->argc < 1)
			{
				fprintf(stderr, "%s: missing command\n", my_commands->argv[0]);
				exit(1);
			}
		}
		else
		{
			fprintf(stderr, "Invalid command %s\n", my_commands->argv[0]);
			exit(1);
		}
	}
	else
	{
		my_commands->type = SQL_COMMAND;

		switch (querymode)
		{
			case QUERY_SIMPLE:
				my_commands->argv[0] = xstrdup(p);
				my_commands->argc++;
				break;
			case QUERY_EXTENDED:
			case QUERY_PREPARED:
				if (!parseQuery(my_commands, p))
					exit(1);
				break;
			default:
				exit(1);
		}
	}

	return my_commands;
}

static int
process_file(char *filename)
{
#define COMMANDS_ALLOC_NUM 128

	Command   **my_commands;
	FILE	   *fd;
	int			lineno;
	char		buf[BUFSIZ];
	int			alloc_num;

	if (num_files >= MAX_FILES)
	{
		fprintf(stderr, "Up to only %d SQL files are allowed\n", MAX_FILES);
		exit(1);
	}

	alloc_num = COMMANDS_ALLOC_NUM;
	my_commands = (Command **) xmalloc(sizeof(Command *) * alloc_num);

	if (strcmp(filename, "-") == 0)
		fd = stdin;
	else if ((fd = fopen(filename, "r")) == NULL)
	{
		fprintf(stderr, "%s: %s\n", filename, strerror(errno));
		return false;
	}

	lineno = 0;

	while (fgets(buf, sizeof(buf), fd) != NULL)
	{
		Command    *command;

		command = process_commands(buf);
		if (command == NULL)
			continue;

		my_commands[lineno] = command;
		lineno++;

		if (lineno >= alloc_num)
		{
			alloc_num += COMMANDS_ALLOC_NUM;
			my_commands = xrealloc(my_commands, sizeof(Command *) * alloc_num);
		}
	}
	fclose(fd);

	my_commands[lineno] = NULL;

	sql_files[num_files++] = my_commands;

	return true;
}

static Command **
process_builtin(char *tb)
{
#define COMMANDS_ALLOC_NUM 128

	Command   **my_commands;
	int			lineno;
	char		buf[BUFSIZ];
	int			alloc_num;

	alloc_num = COMMANDS_ALLOC_NUM;
	my_commands = (Command **) xmalloc(sizeof(Command *) * alloc_num);

	lineno = 0;

	for (;;)
	{
		char	   *p;
		Command    *command;

		p = buf;
		while (*tb && *tb != '\n')
			*p++ = *tb++;

		if (*tb == '\0')
			break;

		if (*tb == '\n')
			tb++;

		*p = '\0';

		command = process_commands(buf);
		if (command == NULL)
			continue;

		my_commands[lineno] = command;
		lineno++;

		if (lineno >= alloc_num)
		{
			alloc_num += COMMANDS_ALLOC_NUM;
			my_commands = xrealloc(my_commands, sizeof(Command *) * alloc_num);
		}
	}

	my_commands[lineno] = NULL;

	return my_commands;
}

/* print out results */
static void
printResults(int ttype, int normal_xacts, int nclients,
			 TState *threads, int nthreads,
			 instr_time total_time, instr_time conn_total_time)
{
	double		time_include,
				tps_include,
				tps_exclude;
	char	   *s;

	time_include = INSTR_TIME_GET_DOUBLE(total_time);
	tps_include = normal_xacts / time_include;
	tps_exclude = normal_xacts / (time_include -
						(INSTR_TIME_GET_DOUBLE(conn_total_time) / nthreads));

	if (ttype == 0)
		s = "TPC-B (sort of)";
	else if (ttype == 2)
		s = "Update only pgbench_accounts";
	else if (ttype == 1)
		s = "SELECT only";
	else
		s = "Custom query";

	printf("transaction type: %s\n", s);
	printf("scaling factor: %d\n", scale);
	printf("query mode: %s\n", QUERYMODE[querymode]);
	printf("number of clients: %d\n", nclients);
	printf("number of threads: %d\n", nthreads);
	if (duration <= 0)
	{
		printf("number of transactions per client: %d\n", nxacts);
		printf("number of transactions actually processed: %d/%d\n",
			   normal_xacts, nxacts * nclients);
	}
	else
	{
		printf("duration: %d s\n", duration);
		printf("number of transactions actually processed: %d\n",
			   normal_xacts);
	}
	printf("tps = %f (including connections establishing)\n", tps_include);
	printf("tps = %f (excluding connections establishing)\n", tps_exclude);

	/* Report per-command latencies */
	if (is_latencies)
	{
		int			i;

		for (i = 0; i < num_files; i++)
		{
			Command   **commands;

			if (num_files > 1)
				printf("statement latencies in milliseconds, file %d:\n", i + 1);
			else
				printf("statement latencies in milliseconds:\n");

			for (commands = sql_files[i]; *commands != NULL; commands++)
			{
				Command    *command = *commands;
				int			cnum = command->command_num;
				double		total_time;
				instr_time	total_exec_elapsed;
				int			total_exec_count;
				int			t;

				/* Accumulate per-thread data for command */
				INSTR_TIME_SET_ZERO(total_exec_elapsed);
				total_exec_count = 0;
				for (t = 0; t < nthreads; t++)
				{
					TState	   *thread = &threads[t];

					INSTR_TIME_ADD(total_exec_elapsed,
								   thread->exec_elapsed[cnum]);
					total_exec_count += thread->exec_count[cnum];
				}

				if (total_exec_count > 0)
					total_time = INSTR_TIME_GET_MILLISEC(total_exec_elapsed) / (double) total_exec_count;
				else
					total_time = 0.0;

				printf("\t%f\t%s\n", total_time, command->line);
			}
		}
	}
}


int
main(int argc, char **argv)
{
	int			c;
	int			nclients = 1;	/* default number of simulated clients */
	int			nthreads = 1;	/* default number of threads */
	int			is_init_mode = 0;		/* initialize mode? */
	int			is_no_vacuum = 0;		/* no vacuum at all before testing? */
	int			do_vacuum_accounts = 0; /* do vacuum accounts before testing? */
	int			ttype = 0;		/* transaction type. 0: TPC-B, 1: SELECT only,
								 * 2: skip update of branches and tellers */
	int			optindex;
	char	   *filename = NULL;
	bool		scale_given = false;

	CState	   *state;			/* status of clients */
	TState	   *threads;		/* array of thread */

	instr_time	start_time;		/* start up time */
	instr_time	total_time;
	instr_time	conn_total_time;
	int			total_xacts;

	int			i;

	static struct option long_options[] = {
		{"foreign-keys", no_argument, &foreign_keys, 1},
		{"index-tablespace", required_argument, NULL, 3},
		{"tablespace", required_argument, NULL, 2},
		{"unlogged-tables", no_argument, &unlogged_tables, 1},
		{NULL, 0, NULL, 0}
	};

#ifdef HAVE_GETRLIMIT
	struct rlimit rlim;
#endif

	PGconn	   *con;
	PGresult   *res;
	char	   *env;

	char		val[64];

	progname = get_progname(argv[0]);

	if (argc > 1)
	{
		if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-?") == 0)
		{
			usage();
			exit(0);
		}
		if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
		{
			puts("pgbench (PostgreSQL) " PG_VERSION);
			exit(0);
		}
	}

#ifdef WIN32
	/* stderr is buffered on Win32. */
	setvbuf(stderr, NULL, _IONBF, 0);
#endif

	if ((env = getenv("PGHOST")) != NULL && *env != '\0')
		pghost = env;
	if ((env = getenv("PGPORT")) != NULL && *env != '\0')
		pgport = env;
	else if ((env = getenv("PGUSER")) != NULL && *env != '\0')
		login = env;

	state = (CState *) xmalloc(sizeof(CState));
	memset(state, 0, sizeof(CState));

	while ((c = getopt_long(argc, argv, "ih:nvp:dSNc:j:Crs:t:T:U:lf:D:F:M:", long_options, &optindex)) != -1)
	{
		switch (c)
		{
			case 'i':
				is_init_mode++;
				break;
			case 'h':
				pghost = optarg;
				break;
			case 'n':
				is_no_vacuum++;
				break;
			case 'v':
				do_vacuum_accounts++;
				break;
			case 'p':
				pgport = optarg;
				break;
			case 'd':
				debug++;
				break;
			case 'S':
				ttype = 1;
				break;
			case 'N':
				ttype = 2;
				break;
			case 'c':
				nclients = atoi(optarg);
				if (nclients <= 0 || nclients > MAXCLIENTS)
				{
					fprintf(stderr, "invalid number of clients: %d\n", nclients);
					exit(1);
				}
#ifdef HAVE_GETRLIMIT
#ifdef RLIMIT_NOFILE			/* most platforms use RLIMIT_NOFILE */
				if (getrlimit(RLIMIT_NOFILE, &rlim) == -1)
#else							/* but BSD doesn't ... */
				if (getrlimit(RLIMIT_OFILE, &rlim) == -1)
#endif   /* RLIMIT_NOFILE */
				{
					fprintf(stderr, "getrlimit failed: %s\n", strerror(errno));
					exit(1);
				}
				if (rlim.rlim_cur <= (nclients + 2))
				{
					fprintf(stderr, "You need at least %d open files but you are only allowed to use %ld.\n", nclients + 2, (long) rlim.rlim_cur);
					fprintf(stderr, "Use limit/ulimit to increase the limit before using pgbench.\n");
					exit(1);
				}
#endif   /* HAVE_GETRLIMIT */
				break;
			case 'j':			/* jobs */
				nthreads = atoi(optarg);
				if (nthreads <= 0)
				{
					fprintf(stderr, "invalid number of threads: %d\n", nthreads);
					exit(1);
				}
				break;
			case 'C':
				is_connect = true;
				break;
			case 'r':
				is_latencies = true;
				break;
			case 's':
				scale_given = true;
				scale = atoi(optarg);
				if (scale <= 0)
				{
					fprintf(stderr, "invalid scaling factor: %d\n", scale);
					exit(1);
				}
				break;
			case 't':
				if (duration > 0)
				{
					fprintf(stderr, "specify either a number of transactions (-t) or a duration (-T), not both.\n");
					exit(1);
				}
				nxacts = atoi(optarg);
				if (nxacts <= 0)
				{
					fprintf(stderr, "invalid number of transactions: %d\n", nxacts);
					exit(1);
				}
				break;
			case 'T':
				if (nxacts > 0)
				{
					fprintf(stderr, "specify either a number of transactions (-t) or a duration (-T), not both.\n");
					exit(1);
				}
				duration = atoi(optarg);
				if (duration <= 0)
				{
					fprintf(stderr, "invalid duration: %d\n", duration);
					exit(1);
				}
				break;
			case 'U':
				login = optarg;
				break;
			case 'l':
				use_log = true;
				break;
			case 'f':
				ttype = 3;
				filename = optarg;
				if (process_file(filename) == false || *sql_files[num_files - 1] == NULL)
					exit(1);
				break;
			case 'D':
				{
					char	   *p;

					if ((p = strchr(optarg, '=')) == NULL || p == optarg || *(p + 1) == '\0')
					{
						fprintf(stderr, "invalid variable definition: %s\n", optarg);
						exit(1);
					}

					*p++ = '\0';
					if (!putVariable(&state[0], "option", optarg, p))
						exit(1);
				}
				break;
			case 'F':
				fillfactor = atoi(optarg);
				if ((fillfactor < 10) || (fillfactor > 100))
				{
					fprintf(stderr, "invalid fillfactor: %d\n", fillfactor);
					exit(1);
				}
				break;
			case 'M':
				if (num_files > 0)
				{
					fprintf(stderr, "query mode (-M) should be specifiled before transaction scripts (-f)\n");
					exit(1);
				}
				for (querymode = 0; querymode < NUM_QUERYMODE; querymode++)
					if (strcmp(optarg, QUERYMODE[querymode]) == 0)
						break;
				if (querymode >= NUM_QUERYMODE)
				{
					fprintf(stderr, "invalid query mode (-M): %s\n", optarg);
					exit(1);
				}
				break;
			case 0:
				/* This covers long options which take no argument. */
				break;
			case 2:				/* tablespace */
				tablespace = optarg;
				break;
			case 3:				/* index-tablespace */
				index_tablespace = optarg;
				break;
			default:
				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
				exit(1);
				break;
		}
	}

	if (argc > optind)
		dbName = argv[optind];
	else
	{
		if ((env = getenv("PGDATABASE")) != NULL && *env != '\0')
			dbName = env;
		else if (login != NULL && *login != '\0')
			dbName = login;
		else
			dbName = "";
	}

	if (is_init_mode)
	{
		init(is_no_vacuum);
		exit(0);
	}

	/* Use DEFAULT_NXACTS if neither nxacts nor duration is specified. */
	if (nxacts <= 0 && duration <= 0)
		nxacts = DEFAULT_NXACTS;

	if (nclients % nthreads != 0)
	{
		fprintf(stderr, "number of clients (%d) must be a multiple of number of threads (%d)\n", nclients, nthreads);
		exit(1);
	}

	/*
	 * is_latencies only works with multiple threads in thread-based
	 * implementations, not fork-based ones, because it supposes that the
	 * parent can see changes made to the per-thread execution stats by child
	 * threads.  It seems useful enough to accept despite this limitation, but
	 * perhaps we should FIXME someday (by passing the stats data back up
	 * through the parent-to-child pipes).
	 */
#ifndef ENABLE_THREAD_SAFETY
	if (is_latencies && nthreads > 1)
	{
		fprintf(stderr, "-r does not work with -j larger than 1 on this platform.\n");
		exit(1);
	}
#endif

	/*
	 * save main process id in the global variable because process id will be
	 * changed after fork.
	 */
	main_pid = (int) getpid();

	if (nclients > 1)
	{
		state = (CState *) xrealloc(state, sizeof(CState) * nclients);
		memset(state + 1, 0, sizeof(CState) * (nclients - 1));

		/* copy any -D switch values to all clients */
		for (i = 1; i < nclients; i++)
		{
			int			j;

			state[i].id = i;
			for (j = 0; j < state[0].nvariables; j++)
			{
				if (!putVariable(&state[i], "startup", state[0].variables[j].name, state[0].variables[j].value))
					exit(1);
			}
		}
	}

	if (debug)
	{
		if (duration <= 0)
			printf("pghost: %s pgport: %s nclients: %d nxacts: %d dbName: %s\n",
				   pghost, pgport, nclients, nxacts, dbName);
		else
			printf("pghost: %s pgport: %s nclients: %d duration: %d dbName: %s\n",
				   pghost, pgport, nclients, duration, dbName);
	}

	/* opening connection... */
	con = doConnect();
	if (con == NULL)
		exit(1);

	if (PQstatus(con) == CONNECTION_BAD)
	{
		fprintf(stderr, "Connection to database '%s' failed.\n", dbName);
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}

	if (ttype != 3)
	{
		/*
		 * get the scaling factor that should be same as count(*) from
		 * pgbench_branches if this is not a custom query
		 */
		res = PQexec(con, "select count(*) from pgbench_branches");
		if (PQresultStatus(res) != PGRES_TUPLES_OK)
		{
			fprintf(stderr, "%s", PQerrorMessage(con));
			exit(1);
		}
		scale = atoi(PQgetvalue(res, 0, 0));
		if (scale < 0)
		{
			fprintf(stderr, "count(*) from pgbench_branches invalid (%d)\n", scale);
			exit(1);
		}
		PQclear(res);

		/* warn if we override user-given -s switch */
		if (scale_given)
			fprintf(stderr,
			"Scale option ignored, using pgbench_branches table count = %d\n",
					scale);
	}

	/*
	 * :scale variables normally get -s or database scale, but don't override
	 * an explicit -D switch
	 */
	if (getVariable(&state[0], "scale") == NULL)
	{
		snprintf(val, sizeof(val), "%d", scale);
		for (i = 0; i < nclients; i++)
		{
			if (!putVariable(&state[i], "startup", "scale", val))
				exit(1);
		}
	}

	if (!is_no_vacuum)
	{
		fprintf(stderr, "starting vacuum...");
		executeStatement(con, "vacuum pgbench_branches");
		executeStatement(con, "vacuum pgbench_tellers");
		executeStatement(con, "truncate pgbench_history");
		fprintf(stderr, "end.\n");

		if (do_vacuum_accounts)
		{
			fprintf(stderr, "starting vacuum pgbench_accounts...");
			executeStatement(con, "vacuum analyze pgbench_accounts");
			fprintf(stderr, "end.\n");
		}
	}
	PQfinish(con);

	/* set random seed */
	INSTR_TIME_SET_CURRENT(start_time);
	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));

	/* process builtin SQL scripts */
	switch (ttype)
	{
		case 0:
			sql_files[0] = process_builtin(tpc_b);
			num_files = 1;
			break;

		case 1:
			sql_files[0] = process_builtin(select_only);
			num_files = 1;
			break;

		case 2:
			sql_files[0] = process_builtin(simple_update);
			num_files = 1;
			break;

		default:
			break;
	}

	/* set up thread data structures */
	threads = (TState *) xmalloc(sizeof(TState) * nthreads);
	for (i = 0; i < nthreads; i++)
	{
		TState	   *thread = &threads[i];

		thread->tid = i;
		thread->state = &state[nclients / nthreads * i];
		thread->nstate = nclients / nthreads;
		thread->random_state[0] = random();
		thread->random_state[1] = random();
		thread->random_state[2] = random();

		if (is_latencies)
		{
			/* Reserve memory for the thread to store per-command latencies */
			int			t;

			thread->exec_elapsed = (instr_time *)
				xmalloc(sizeof(instr_time) * num_commands);
			thread->exec_count = (int *)
				xmalloc(sizeof(int) * num_commands);

			for (t = 0; t < num_commands; t++)
			{
				INSTR_TIME_SET_ZERO(thread->exec_elapsed[t]);
				thread->exec_count[t] = 0;
			}
		}
		else
		{
			thread->exec_elapsed = NULL;
			thread->exec_count = NULL;
		}
	}

	/* get start up time */
	INSTR_TIME_SET_CURRENT(start_time);

	/* set alarm if duration is specified. */
	if (duration > 0)
		setalarm(duration);

	/* start threads */
	for (i = 0; i < nthreads; i++)
	{
		TState	   *thread = &threads[i];

		INSTR_TIME_SET_CURRENT(thread->start_time);

		/* the first thread (i = 0) is executed by main thread */
		if (i > 0)
		{
			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);

			if (err != 0 || thread->thread == INVALID_THREAD)
			{
				fprintf(stderr, "cannot create thread: %s\n", strerror(err));
				exit(1);
			}
		}
		else
		{
			thread->thread = INVALID_THREAD;
		}
	}

	/* wait for threads and accumulate results */
	total_xacts = 0;
	INSTR_TIME_SET_ZERO(conn_total_time);
	for (i = 0; i < nthreads; i++)
	{
		void	   *ret = NULL;

		if (threads[i].thread == INVALID_THREAD)
			ret = threadRun(&threads[i]);
		else
			pthread_join(threads[i].thread, &ret);

		if (ret != NULL)
		{
			TResult    *r = (TResult *) ret;

			total_xacts += r->xacts;
			INSTR_TIME_ADD(conn_total_time, r->conn_time);
			free(ret);
		}
	}
	disconnect_all(state, nclients);

	/* get end time */
	INSTR_TIME_SET_CURRENT(total_time);
	INSTR_TIME_SUBTRACT(total_time, start_time);
	printResults(ttype, total_xacts, nclients, threads, nthreads,
				 total_time, conn_total_time);

	return 0;
}

static void *
threadRun(void *arg)
{
	TState	   *thread = (TState *) arg;
	CState	   *state = thread->state;
	TResult    *result;
	FILE	   *logfile = NULL; /* per-thread log file */
	instr_time	start,
				end;
	int			nstate = thread->nstate;
	int			remains = nstate;		/* number of remaining clients */
	int			i;

	result = xmalloc(sizeof(TResult));
	INSTR_TIME_SET_ZERO(result->conn_time);

	/* open log file if requested */
	if (use_log)
	{
		char		logpath[64];

		if (thread->tid == 0)
			snprintf(logpath, sizeof(logpath), "pgbench_log.%d", main_pid);
		else
			snprintf(logpath, sizeof(logpath), "pgbench_log.%d.%d", main_pid, thread->tid);
		logfile = fopen(logpath, "w");

		if (logfile == NULL)
		{
			fprintf(stderr, "Couldn't open logfile \"%s\": %s", logpath, strerror(errno));
			goto done;
		}
	}

	if (!is_connect)
	{
		/* make connections to the database */
		for (i = 0; i < nstate; i++)
		{
			if ((state[i].con = doConnect()) == NULL)
				goto done;
		}
	}

	/* time after thread and connections set up */
	INSTR_TIME_SET_CURRENT(result->conn_time);
	INSTR_TIME_SUBTRACT(result->conn_time, thread->start_time);

	/* send start up queries in async manner */
	for (i = 0; i < nstate; i++)
	{
		CState	   *st = &state[i];
		Command   **commands = sql_files[st->use_file];
		int			prev_ecnt = st->ecnt;

		st->use_file = getrand(thread, 0, num_files - 1);
		if (!doCustom(thread, st, &result->conn_time, logfile))
			remains--;			/* I've aborted */

		if (st->ecnt > prev_ecnt && commands[st->state]->type == META_COMMAND)
		{
			fprintf(stderr, "Client %d aborted in state %d. Execution meta-command failed.\n", i, st->state);
			remains--;			/* I've aborted */
			PQfinish(st->con);
			st->con = NULL;
		}
	}

	while (remains > 0)
	{
		fd_set		input_mask;
		int			maxsock;	/* max socket number to be waited */
		int64		now_usec = 0;
		int64		min_usec;

		FD_ZERO(&input_mask);

		maxsock = -1;
		min_usec = INT64_MAX;
		for (i = 0; i < nstate; i++)
		{
			CState	   *st = &state[i];
			Command   **commands = sql_files[st->use_file];
			int			sock;

			if (st->sleeping)
			{
				int			this_usec;

				if (min_usec == INT64_MAX)
				{
					instr_time	now;

					INSTR_TIME_SET_CURRENT(now);
					now_usec = INSTR_TIME_GET_MICROSEC(now);
				}

				this_usec = st->until - now_usec;
				if (min_usec > this_usec)
					min_usec = this_usec;
			}
			else if (st->con == NULL)
			{
				continue;
			}
			else if (commands[st->state]->type == META_COMMAND)
			{
				min_usec = 0;	/* the connection is ready to run */
				break;
			}

			sock = PQsocket(st->con);
			if (sock < 0)
			{
				fprintf(stderr, "bad socket: %s\n", strerror(errno));
				goto done;
			}

			FD_SET(sock, &input_mask);

			if (maxsock < sock)
				maxsock = sock;
		}

		if (min_usec > 0 && maxsock != -1)
		{
			int			nsocks; /* return from select(2) */

			if (min_usec != INT64_MAX)
			{
				struct timeval timeout;

				timeout.tv_sec = min_usec / 1000000;
				timeout.tv_usec = min_usec % 1000000;
				nsocks = select(maxsock + 1, &input_mask, NULL, NULL, &timeout);
			}
			else
				nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL);
			if (nsocks < 0)
			{
				if (errno == EINTR)
					continue;
				/* must be something wrong */
				fprintf(stderr, "select failed: %s\n", strerror(errno));
				goto done;
			}
		}

		/* ok, backend returns reply */
		for (i = 0; i < nstate; i++)
		{
			CState	   *st = &state[i];
			Command   **commands = sql_files[st->use_file];
			int			prev_ecnt = st->ecnt;

			if (st->con && (FD_ISSET(PQsocket(st->con), &input_mask)
							|| commands[st->state]->type == META_COMMAND))
			{
				if (!doCustom(thread, st, &result->conn_time, logfile))
					remains--;	/* I've aborted */
			}

			if (st->ecnt > prev_ecnt && commands[st->state]->type == META_COMMAND)
			{
				fprintf(stderr, "Client %d aborted in state %d. Execution of meta-command failed.\n", i, st->state);
				remains--;		/* I've aborted */
				PQfinish(st->con);
				st->con = NULL;
			}
		}
	}

done:
	INSTR_TIME_SET_CURRENT(start);
	disconnect_all(state, nstate);
	result->xacts = 0;
	for (i = 0; i < nstate; i++)
		result->xacts += state[i].cnt;
	INSTR_TIME_SET_CURRENT(end);
	INSTR_TIME_ACCUM_DIFF(result->conn_time, end, start);
	if (logfile)
		fclose(logfile);
	return result;
}


/*
 * Support for duration option: set timer_exceeded after so many seconds.
 */

#ifndef WIN32

static void
handle_sig_alarm(SIGNAL_ARGS)
{
	timer_exceeded = true;
}

static void
setalarm(int seconds)
{
	pqsignal(SIGALRM, handle_sig_alarm);
	alarm(seconds);
}

#ifndef ENABLE_THREAD_SAFETY

/*
 * implements pthread using fork.
 */

typedef struct fork_pthread
{
	pid_t		pid;
	int			pipes[2];
}	fork_pthread;

static int
pthread_create(pthread_t *thread,
			   pthread_attr_t *attr,
			   void *(*start_routine) (void *),
			   void *arg)
{
	fork_pthread *th;
	void	   *ret;

	th = (fork_pthread *) xmalloc(sizeof(fork_pthread));
	if (pipe(th->pipes) < 0)
	{
		free(th);
		return errno;
	}

	th->pid = fork();
	if (th->pid == -1)			/* error */
	{
		free(th);
		return errno;
	}
	if (th->pid != 0)			/* in parent process */
	{
		close(th->pipes[1]);
		*thread = th;
		return 0;
	}

	/* in child process */
	close(th->pipes[0]);

	/* set alarm again because the child does not inherit timers */
	if (duration > 0)
		setalarm(duration);

	ret = start_routine(arg);
	write(th->pipes[1], ret, sizeof(TResult));
	close(th->pipes[1]);
	free(th);
	exit(0);
}

static int
pthread_join(pthread_t th, void **thread_return)
{
	int			status;

	while (waitpid(th->pid, &status, 0) != th->pid)
	{
		if (errno != EINTR)
			return errno;
	}

	if (thread_return != NULL)
	{
		/* assume result is TResult */
		*thread_return = xmalloc(sizeof(TResult));
		if (read(th->pipes[0], *thread_return, sizeof(TResult)) != sizeof(TResult))
		{
			free(*thread_return);
			*thread_return = NULL;
		}
	}
	close(th->pipes[0]);

	free(th);
	return 0;
}
#endif
#else							/* WIN32 */

static VOID CALLBACK
win32_timer_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
{
	timer_exceeded = true;
}

static void
setalarm(int seconds)
{
	HANDLE		queue;
	HANDLE		timer;

	/* This function will be called at most once, so we can cheat a bit. */
	queue = CreateTimerQueue();
	if (seconds > ((DWORD) -1) / 1000 ||
		!CreateTimerQueueTimer(&timer, queue,
							   win32_timer_callback, NULL, seconds * 1000, 0,
							   WT_EXECUTEINTIMERTHREAD | WT_EXECUTEONLYONCE))
	{
		fprintf(stderr, "Failed to set timer\n");
		exit(1);
	}
}

/* partial pthread implementation for Windows */

typedef struct win32_pthread
{
	HANDLE		handle;
	void	   *(*routine) (void *);
	void	   *arg;
	void	   *result;
} win32_pthread;

static unsigned __stdcall
win32_pthread_run(void *arg)
{
	win32_pthread *th = (win32_pthread *) arg;

	th->result = th->routine(th->arg);

	return 0;
}

static int
pthread_create(pthread_t *thread,
			   pthread_attr_t *attr,
			   void *(*start_routine) (void *),
			   void *arg)
{
	int			save_errno;
	win32_pthread *th;

	th = (win32_pthread *) xmalloc(sizeof(win32_pthread));
	th->routine = start_routine;
	th->arg = arg;
	th->result = NULL;

	th->handle = (HANDLE) _beginthreadex(NULL, 0, win32_pthread_run, th, 0, NULL);
	if (th->handle == NULL)
	{
		save_errno = errno;
		free(th);
		return save_errno;
	}

	*thread = th;
	return 0;
}

static int
pthread_join(pthread_t th, void **thread_return)
{
	if (th == NULL || th->handle == NULL)
		return errno = EINVAL;

	if (WaitForSingleObject(th->handle, INFINITE) != WAIT_OBJECT_0)
	{
		_dosmaperr(GetLastError());
		return errno;
	}

	if (thread_return)
		*thread_return = th->result;

	CloseHandle(th->handle);
	free(th);
	return 0;
}

#endif   /* WIN32 */

pgbench_wal_modified_and_lz_random_test.htmtext/html; name=pgbench_wal_modified_and_lz_random_test.htmDownload

#11

Amit Kapila

amit.kapila@huawei.com

over 13 years ago

In reply to: Amit Kapila (#10)

3 attachment(s)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Friday, September 28, 2012 7:03 PM Amit Kapila wrote:

On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:

On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
On 25.09.2012 18:27, Amit Kapila wrote:

If you feel it is must to do the comparison, we can do it in same

way

as we identify for HOT?

Now I shall do the various tests for following and post it here:
a. Attached Patch in the mode where it takes advantage of history
tuple b. By changing the logic for modified column calculation to use
calculation for memcmp()

Attached documents contain data for following scenarios for both 'a' (LZ
compression patch) and 'b' (modified wal patch) patches:

1. Using fixed string (last few bytes are random) to update the column
values.
Total record length = 1800
Updated columns length = 250
2. Using random string to update the column values
Total record length = 1800
Updated columns length = 250

Observations -
1. With both patches performance increase is very good .
2. Almost same performance increase with both patches with slightly
more for LZ compression patch.
3. TPS is varying with LZ patch, but if we take average it is
equivalent to other patch.

Other Performance tests I am planning to conduct
1. Using bigger random string to update the column values
Total record length = 1800
Updated columns length = 250
2. Using fixed string (last few bytes are random) to update the column
values.
Total record length = 1800
Updated columns length = 50, 100, 500, 750, 1000, 1500, 1800

1. Please find the results (pgbench_test.htm) for point -2 where there is
one fixed column updation (last few bytes are random) and second column
updation is 32 byte random string. The results for 50, 100 are still going
on others are attached with this mail.
2. Attached pgbench test code for a modification of 25 and 250 bytes record
size having total record length as 1800.
For the other record size modification tests, the schema is changed
accordingly.
3. Added a random string generation for updating some column data from 250
record modification test onwards.
CREATE OR REPLACE FUNCTION random_text_md5_v2(INTEGER)
RETURNS TEXT
LANGUAGE SQL
AS $$
select upper(
substring(
(
SELECT string_agg(md5(random()::TEXT), '')
FROM generate_series(1, CEIL($1 / 32.)::integer)
),
$1)
);
$$;
4. Observations
a. When the size of updated value is less, the performance is almost
same for both the patches.
b. When the size of updated value is more, the performance with LZ patch
is better.

3. Recovery performance test as suggested by Noah

Still not started.

4. Complete testing for LZ compression patch using testcases defined for
original patch

a. During testing of LZ patch, few issues are found related to when the
updated record contains NULLS. Working on it to fix.

Any comments/suggestions regarding performance/functionality test?

With Regards,
Amit Kapila.

Attachments:

pgbench_test.htmtext/html; name=pgbench_test.htmDownload

pgbench_25.capplication/octet-stream; name=pgbench_25.cDownload

/*
 * pgbench.c
 *
 * A simple benchmark program for PostgreSQL
 * Originally written by Tatsuo Ishii and enhanced by many contributors.
 *
 * contrib/pgbench/pgbench.c
 * Copyright (c) 2000-2012, PostgreSQL Global Development Group
 * ALL RIGHTS RESERVED;
 *
 * Permission to use, copy, modify, and distribute this software and its
 * documentation for any purpose, without fee, and without a written agreement
 * is hereby granted, provided that the above copyright notice and this
 * paragraph and the following two paragraphs appear in all copies.
 *
 * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
 * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
 * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
 * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
 * POSSIBILITY OF SUCH DAMAGE.
 *
 * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIMS ANY WARRANTIES,
 * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
 * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
 * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
 * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
 *
 */

#ifdef WIN32
#define FD_SETSIZE 1024			/* set before winsock2.h is included */
#endif   /* ! WIN32 */

#include "postgres_fe.h"

#include "getopt_long.h"
#include "libpq-fe.h"
#include "libpq/pqsignal.h"
#include "portability/instr_time.h"

#include <ctype.h>

#ifndef WIN32
#include <sys/time.h>
#include <unistd.h>
#endif   /* ! WIN32 */

#ifdef HAVE_SYS_SELECT_H
#include <sys/select.h>
#endif

#ifdef HAVE_SYS_RESOURCE_H
#include <sys/resource.h>		/* for getrlimit */
#endif

#ifndef INT64_MAX
#define INT64_MAX	INT64CONST(0x7FFFFFFFFFFFFFFF)
#endif

/*
 * Multi-platform pthread implementations
 */

#ifdef WIN32
/* Use native win32 threads on Windows */
typedef struct win32_pthread *pthread_t;
typedef int pthread_attr_t;

static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
static int	pthread_join(pthread_t th, void **thread_return);
#elif defined(ENABLE_THREAD_SAFETY)
/* Use platform-dependent pthread capability */
#include <pthread.h>
#else
/* Use emulation with fork. Rename pthread identifiers to avoid conflicts */

#include <sys/wait.h>

#define pthread_t				pg_pthread_t
#define pthread_attr_t			pg_pthread_attr_t
#define pthread_create			pg_pthread_create
#define pthread_join			pg_pthread_join

typedef struct fork_pthread *pthread_t;
typedef int pthread_attr_t;

static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
static int	pthread_join(pthread_t th, void **thread_return);
#endif

extern char *optarg;
extern int	optind;


/********************************************************************
 * some configurable parameters */

/* max number of clients allowed */
#ifdef FD_SETSIZE
#define MAXCLIENTS	(FD_SETSIZE - 10)
#else
#define MAXCLIENTS	1024
#endif

#define DEFAULT_NXACTS	10		/* default nxacts */

int			nxacts = 0;			/* number of transactions per client */
int			duration = 0;		/* duration in seconds */

/*
 * scaling factor. for example, scale = 10 will make 1000000 tuples in
 * pgbench_accounts table.
 */
int			scale = 1;

/*
 * fillfactor. for example, fillfactor = 90 will use only 90 percent
 * space during inserts and leave 10 percent free.
 */
int			fillfactor = 100;

/*
 * create foreign key constraints on the tables?
 */
int			foreign_keys = 0;

/*
 * use unlogged tables?
 */
int			unlogged_tables = 0;

/*
 * tablespace selection
 */
char	   *tablespace = NULL;
char	   *index_tablespace = NULL;

/*
 * end of configurable parameters
 *********************************************************************/

#define nbranches	1			/* Makes little sense to change this.  Change
								 * -s instead */
#define ntellers	10
#define naccounts	100000

bool		use_log;			/* log transaction latencies to a file */
bool		is_connect;			/* establish connection for each transaction */
bool		is_latencies;		/* report per-command latencies */
int			main_pid;			/* main process id used in log filename */

char	   *pghost = "";
char	   *pgport = "";
char	   *login = NULL;
char	   *dbName;
const char *progname;

volatile bool timer_exceeded = false;	/* flag from signal handler */

/* variable definitions */
typedef struct
{
	char	   *name;			/* variable name */
	char	   *value;			/* its value */
} Variable;

#define MAX_FILES		128		/* max number of SQL script files allowed */
#define SHELL_COMMAND_SIZE	256 /* maximum size allowed for shell command */

/*
 * structures used in custom query mode
 */

typedef struct
{
	PGconn	   *con;			/* connection handle to DB */
	int			id;				/* client No. */
	int			state;			/* state No. */
	int			cnt;			/* xacts count */
	int			ecnt;			/* error count */
	int			listen;			/* 0 indicates that an async query has been
								 * sent */
	int			sleeping;		/* 1 indicates that the client is napping */
	int64		until;			/* napping until (usec) */
	Variable   *variables;		/* array of variable definitions */
	int			nvariables;
	instr_time	txn_begin;		/* used for measuring transaction latencies */
	instr_time	stmt_begin;		/* used for measuring statement latencies */
	int			use_file;		/* index in sql_files for this client */
	bool		prepared[MAX_FILES];
} CState;

/*
 * Thread state and result
 */
typedef struct
{
	int			tid;			/* thread id */
	pthread_t	thread;			/* thread handle */
	CState	   *state;			/* array of CState */
	int			nstate;			/* length of state[] */
	instr_time	start_time;		/* thread start time */
	instr_time *exec_elapsed;	/* time spent executing cmds (per Command) */
	int		   *exec_count;		/* number of cmd executions (per Command) */
	unsigned short random_state[3];		/* separate randomness for each thread */
} TState;

#define INVALID_THREAD		((pthread_t) 0)

typedef struct
{
	instr_time	conn_time;
	int			xacts;
} TResult;

/*
 * queries read from files
 */
#define SQL_COMMAND		1
#define META_COMMAND	2
#define MAX_ARGS		10

typedef enum QueryMode
{
	QUERY_SIMPLE,				/* simple query */
	QUERY_EXTENDED,				/* extended query */
	QUERY_PREPARED,				/* extended query with prepared statements */
	NUM_QUERYMODE
} QueryMode;

static QueryMode querymode = QUERY_SIMPLE;
static const char *QUERYMODE[] = {"simple", "extended", "prepared"};

typedef struct
{
	char	   *line;			/* full text of command line */
	int			command_num;	/* unique index of this Command struct */
	int			type;			/* command type (SQL_COMMAND or META_COMMAND) */
	int			argc;			/* number of command words */
	char	   *argv[MAX_ARGS]; /* command word list */
} Command;

static Command **sql_files[MAX_FILES];	/* SQL script files */
static int	num_files;			/* number of script files */
static int	num_commands = 0;	/* total number of Command structs */
static int	debug = 0;			/* debug flag */

/* default scenario */
static char *tpc_b = {
	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"\\setrandom bid 1 :nbranches\n"
	"\\setrandom tid 1 :ntellers\n"
	"\\setrandom delta -5000 5000\n"
	"BEGIN;\n"
	"UPDATE pgbench_accounts SET abalance = abalance + :delta,"
	"filler = \'abcdefghijklmno :delta\'"
	" WHERE aid = :aid;\n"
	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta,"
	"filler = \'abcdefghijklmno :delta\'"
	" WHERE tid = :tid;\n"
	"UPDATE pgbench_branches SET bbalance = bbalance + :delta,"
	"filler = \'abcdefghijklmno :delta\'"
	" WHERE bid = :bid;\n"
	"END;\n"
};

/* -N case */
static char *simple_update = {
	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"\\setrandom bid 1 :nbranches\n"
	"\\setrandom tid 1 :ntellers\n"
	"\\setrandom delta -5000 5000\n"
	"BEGIN;\n"
	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
	"END;\n"
};

/* -S case */
static char *select_only = {
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
};

/* Function prototypes */
static void setalarm(int seconds);
static void *threadRun(void *arg);


/*
 * routines to check mem allocations and fail noisily.
 */
static void *
xmalloc(size_t size)
{
	void	   *result;

	result = malloc(size);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}

static void *
xrealloc(void *ptr, size_t size)
{
	void	   *result;

	result = realloc(ptr, size);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}

static char *
xstrdup(const char *s)
{
	char	   *result;

	result = strdup(s);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}


static void
usage(void)
{
	printf("%s is a benchmarking tool for PostgreSQL.\n\n"
		   "Usage:\n"
		   "  %s [OPTION]... [DBNAME]\n"
		   "\nInitialization options:\n"
		   "  -i           invokes initialization mode\n"
		   "  -n           do not run VACUUM after initialization\n"
		   "  -F NUM       fill factor\n"
		   "  -s NUM       scaling factor\n"
		   "  --foreign-keys\n"
		   "               create foreign key constraints between tables\n"
		   "  --index-tablespace=TABLESPACE\n"
		   "               create indexes in the specified tablespace\n"
		   "  --tablespace=TABLESPACE\n"
		   "               create tables in the specified tablespace\n"
		   "  --unlogged-tables\n"
		   "               create tables as unlogged tables\n"
		   "\nBenchmarking options:\n"
		"  -c NUM       number of concurrent database clients (default: 1)\n"
		   "  -C           establish new connection for each transaction\n"
		   "  -D VARNAME=VALUE\n"
		   "               define variable for use by custom script\n"
		   "  -f FILENAME  read transaction script from FILENAME\n"
		   "  -j NUM       number of threads (default: 1)\n"
		   "  -l           write transaction times to log file\n"
		   "  -M simple|extended|prepared\n"
		   "               protocol for submitting queries to server (default: simple)\n"
		   "  -n           do not run VACUUM before tests\n"
		   "  -N           do not update tables \"pgbench_tellers\" and \"pgbench_branches\"\n"
		   "  -r           report average latency per command\n"
		   "  -s NUM       report this scale factor in output\n"
		   "  -S           perform SELECT-only transactions\n"
	 "  -t NUM       number of transactions each client runs (default: 10)\n"
		   "  -T NUM       duration of benchmark test in seconds\n"
		   "  -v           vacuum all four standard tables before tests\n"
		   "\nCommon options:\n"
		   "  -d             print debugging output\n"
		   "  -h HOSTNAME    database server host or socket directory\n"
		   "  -p PORT        database server port number\n"
		   "  -U USERNAME    connect as specified database user\n"
		   "  -V, --version  output version information, then exit\n"
		   "  -?, --help     show this help, then exit\n"
		   "\n"
		   "Report bugs to <pgsql-bugs@postgresql.org>.\n",
		   progname, progname);
}

/* random number generator: uniform distribution from min to max inclusive */
static int
getrand(TState *thread, int min, int max)
{
	/*
	 * Odd coding is so that min and max have approximately the same chance of
	 * being selected as do numbers between them.
	 *
	 * pg_erand48() is thread-safe and concurrent, which is why we use it
	 * rather than random(), which in glibc is non-reentrant, and therefore
	 * protected by a mutex, and therefore a bottleneck on machines with many
	 * CPUs.
	 */
	return min + (int) ((max - min + 1) * pg_erand48(thread->random_state));
}

/* call PQexec() and exit() on failure */
static void
executeStatement(PGconn *con, const char *sql)
{
	PGresult   *res;

	res = PQexec(con, sql);
	if (PQresultStatus(res) != PGRES_COMMAND_OK)
	{
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}
	PQclear(res);
}

/* set up a connection to the backend */
static PGconn *
doConnect(void)
{
	PGconn	   *conn;
	static char *password = NULL;
	bool		new_pass;

	/*
	 * Start the connection.  Loop until we have a password if requested by
	 * backend.
	 */
	do
	{
#define PARAMS_ARRAY_SIZE	7

		const char *keywords[PARAMS_ARRAY_SIZE];
		const char *values[PARAMS_ARRAY_SIZE];

		keywords[0] = "host";
		values[0] = pghost;
		keywords[1] = "port";
		values[1] = pgport;
		keywords[2] = "user";
		values[2] = login;
		keywords[3] = "password";
		values[3] = password;
		keywords[4] = "dbname";
		values[4] = dbName;
		keywords[5] = "fallback_application_name";
		values[5] = progname;
		keywords[6] = NULL;
		values[6] = NULL;

		new_pass = false;

		conn = PQconnectdbParams(keywords, values, true);

		if (!conn)
		{
			fprintf(stderr, "Connection to database \"%s\" failed\n",
					dbName);
			return NULL;
		}

		if (PQstatus(conn) == CONNECTION_BAD &&
			PQconnectionNeedsPassword(conn) &&
			password == NULL)
		{
			PQfinish(conn);
			password = simple_prompt("Password: ", 100, false);
			new_pass = true;
		}
	} while (new_pass);

	/* check to see that the backend connection was successfully made */
	if (PQstatus(conn) == CONNECTION_BAD)
	{
		fprintf(stderr, "Connection to database \"%s\" failed:\n%s",
				dbName, PQerrorMessage(conn));
		PQfinish(conn);
		return NULL;
	}

	return conn;
}

/* throw away response from backend */
static void
discard_response(CState *state)
{
	PGresult   *res;

	do
	{
		res = PQgetResult(state->con);
		if (res)
			PQclear(res);
	} while (res);
}

static int
compareVariables(const void *v1, const void *v2)
{
	return strcmp(((const Variable *) v1)->name,
				  ((const Variable *) v2)->name);
}

static char *
getVariable(CState *st, char *name)
{
	Variable	key,
			   *var;

	/* On some versions of Solaris, bsearch of zero items dumps core */
	if (st->nvariables <= 0)
		return NULL;

	key.name = name;
	var = (Variable *) bsearch((void *) &key,
							   (void *) st->variables,
							   st->nvariables,
							   sizeof(Variable),
							   compareVariables);
	if (var != NULL)
		return var->value;
	else
		return NULL;
}

/* check whether the name consists of alphabets, numerals and underscores. */
static bool
isLegalVariableName(const char *name)
{
	int			i;

	for (i = 0; name[i] != '\0'; i++)
	{
		if (!isalnum((unsigned char) name[i]) && name[i] != '_')
			return false;
	}

	return true;
}

static int
putVariable(CState *st, const char *context, char *name, char *value)
{
	Variable	key,
			   *var;

	key.name = name;
	/* On some versions of Solaris, bsearch of zero items dumps core */
	if (st->nvariables > 0)
		var = (Variable *) bsearch((void *) &key,
								   (void *) st->variables,
								   st->nvariables,
								   sizeof(Variable),
								   compareVariables);
	else
		var = NULL;

	if (var == NULL)
	{
		Variable   *newvars;

		/*
		 * Check for the name only when declaring a new variable to avoid
		 * overhead.
		 */
		if (!isLegalVariableName(name))
		{
			fprintf(stderr, "%s: invalid variable name '%s'\n", context, name);
			return false;
		}

		if (st->variables)
			newvars = (Variable *) xrealloc(st->variables,
									(st->nvariables + 1) * sizeof(Variable));
		else
			newvars = (Variable *) xmalloc(sizeof(Variable));

		st->variables = newvars;

		var = &newvars[st->nvariables];

		var->name = xstrdup(name);
		var->value = xstrdup(value);

		st->nvariables++;

		qsort((void *) st->variables, st->nvariables, sizeof(Variable),
			  compareVariables);
	}
	else
	{
		char	   *val;

		/* dup then free, in case value is pointing at this variable */
		val = xstrdup(value);

		free(var->value);
		var->value = val;
	}

	return true;
}

static char *
parseVariable(const char *sql, int *eaten)
{
	int			i = 0;
	char	   *name;

	do
	{
		i++;
	} while (isalnum((unsigned char) sql[i]) || sql[i] == '_');
	if (i == 1)
		return NULL;

	name = xmalloc(i);
	memcpy(name, &sql[1], i - 1);
	name[i - 1] = '\0';

	*eaten = i;
	return name;
}

static char *
replaceVariable(char **sql, char *param, int len, char *value)
{
	int			valueln = strlen(value);

	if (valueln > len)
	{
		size_t		offset = param - *sql;

		*sql = xrealloc(*sql, strlen(*sql) - len + valueln + 1);
		param = *sql + offset;
	}

	if (valueln != len)
		memmove(param + valueln, param + len, strlen(param + len) + 1);
	strncpy(param, value, valueln);

	return param + valueln;
}

static char *
assignVariables(CState *st, char *sql)
{
	char	   *p,
			   *name,
			   *val;

	p = sql;
	while ((p = strchr(p, ':')) != NULL)
	{
		int			eaten;

		name = parseVariable(p, &eaten);
		if (name == NULL)
		{
			while (*p == ':')
			{
				p++;
			}
			continue;
		}

		val = getVariable(st, name);
		free(name);
		if (val == NULL)
		{
			p++;
			continue;
		}

		p = replaceVariable(&sql, p, eaten, val);
	}

	return sql;
}

static void
getQueryParams(CState *st, const Command *command, const char **params)
{
	int			i;

	for (i = 0; i < command->argc - 1; i++)
		params[i] = getVariable(st, command->argv[i + 1]);
}

/*
 * Run a shell command. The result is assigned to the variable if not NULL.
 * Return true if succeeded, or false on error.
 */
static bool
runShellCommand(CState *st, char *variable, char **argv, int argc)
{
	char		command[SHELL_COMMAND_SIZE];
	int			i,
				len = 0;
	FILE	   *fp;
	char		res[64];
	char	   *endptr;
	int			retval;

	/*----------
	 * Join arguments with whitespace separators. Arguments starting with
	 * exactly one colon are treated as variables:
	 *	name - append a string "name"
	 *	:var - append a variable named 'var'
	 *	::name - append a string ":name"
	 *----------
	 */
	for (i = 0; i < argc; i++)
	{
		char	   *arg;
		int			arglen;

		if (argv[i][0] != ':')
		{
			arg = argv[i];		/* a string literal */
		}
		else if (argv[i][1] == ':')
		{
			arg = argv[i] + 1;	/* a string literal starting with colons */
		}
		else if ((arg = getVariable(st, argv[i] + 1)) == NULL)
		{
			fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[i]);
			return false;
		}

		arglen = strlen(arg);
		if (len + arglen + (i > 0 ? 1 : 0) >= SHELL_COMMAND_SIZE - 1)
		{
			fprintf(stderr, "%s: too long shell command\n", argv[0]);
			return false;
		}

		if (i > 0)
			command[len++] = ' ';
		memcpy(command + len, arg, arglen);
		len += arglen;
	}

	command[len] = '\0';

	/* Fast path for non-assignment case */
	if (variable == NULL)
	{
		if (system(command))
		{
			if (!timer_exceeded)
				fprintf(stderr, "%s: cannot launch shell command\n", argv[0]);
			return false;
		}
		return true;
	}

	/* Execute the command with pipe and read the standard output. */
	if ((fp = popen(command, "r")) == NULL)
	{
		fprintf(stderr, "%s: cannot launch shell command\n", argv[0]);
		return false;
	}
	if (fgets(res, sizeof(res), fp) == NULL)
	{
		if (!timer_exceeded)
			fprintf(stderr, "%s: cannot read the result\n", argv[0]);
		return false;
	}
	if (pclose(fp) < 0)
	{
		fprintf(stderr, "%s: cannot close shell command\n", argv[0]);
		return false;
	}

	/* Check whether the result is an integer and assign it to the variable */
	retval = (int) strtol(res, &endptr, 10);
	while (*endptr != '\0' && isspace((unsigned char) *endptr))
		endptr++;
	if (*res == '\0' || *endptr != '\0')
	{
		fprintf(stderr, "%s: must return an integer ('%s' returned)\n", argv[0], res);
		return false;
	}
	snprintf(res, sizeof(res), "%d", retval);
	if (!putVariable(st, "setshell", variable, res))
		return false;

#ifdef DEBUG
	printf("shell parameter name: %s, value: %s\n", argv[1], res);
#endif
	return true;
}

#define MAX_PREPARE_NAME		32
static void
preparedStatementName(char *buffer, int file, int state)
{
	sprintf(buffer, "P%d_%d", file, state);
}

static bool
clientDone(CState *st, bool ok)
{
	(void) ok;					/* unused */

	if (st->con != NULL)
	{
		PQfinish(st->con);
		st->con = NULL;
	}
	return false;				/* always false */
}

/* return false iff client should be disconnected */
static bool
doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile)
{
	PGresult   *res;
	Command   **commands;

top:
	commands = sql_files[st->use_file];

	if (st->sleeping)
	{							/* are we sleeping? */
		instr_time	now;

		INSTR_TIME_SET_CURRENT(now);
		if (st->until <= INSTR_TIME_GET_MICROSEC(now))
			st->sleeping = 0;	/* Done sleeping, go ahead with next command */
		else
			return true;		/* Still sleeping, nothing to do here */
	}

	if (st->listen)
	{							/* are we receiver? */
		if (commands[st->state]->type == SQL_COMMAND)
		{
			if (debug)
				fprintf(stderr, "client %d receiving\n", st->id);
			if (!PQconsumeInput(st->con))
			{					/* there's something wrong */
				fprintf(stderr, "Client %d aborted in state %d. Probably the backend died while processing.\n", st->id, st->state);
				return clientDone(st, false);
			}
			if (PQisBusy(st->con))
				return true;	/* don't have the whole result yet */
		}

		/*
		 * command finished: accumulate per-command execution times in
		 * thread-local data structure, if per-command latencies are requested
		 */
		if (is_latencies)
		{
			instr_time	now;
			int			cnum = commands[st->state]->command_num;

			INSTR_TIME_SET_CURRENT(now);
			INSTR_TIME_ACCUM_DIFF(thread->exec_elapsed[cnum],
								  now, st->stmt_begin);
			thread->exec_count[cnum]++;
		}

		/*
		 * if transaction finished, record the time it took in the log
		 */
		if (logfile && commands[st->state + 1] == NULL)
		{
			instr_time	now;
			instr_time	diff;
			double		usec;

			INSTR_TIME_SET_CURRENT(now);
			diff = now;
			INSTR_TIME_SUBTRACT(diff, st->txn_begin);
			usec = (double) INSTR_TIME_GET_MICROSEC(diff);

#ifndef WIN32
			/* This is more than we really ought to know about instr_time */
			fprintf(logfile, "%d %d %.0f %d %ld %ld\n",
					st->id, st->cnt, usec, st->use_file,
					(long) now.tv_sec, (long) now.tv_usec);
#else
			/* On Windows, instr_time doesn't provide a timestamp anyway */
			fprintf(logfile, "%d %d %.0f %d 0 0\n",
					st->id, st->cnt, usec, st->use_file);
#endif
		}

		if (commands[st->state]->type == SQL_COMMAND)
		{
			/*
			 * Read and discard the query result; note this is not included in
			 * the statement latency numbers.
			 */
			res = PQgetResult(st->con);
			switch (PQresultStatus(res))
			{
				case PGRES_COMMAND_OK:
				case PGRES_TUPLES_OK:
					break;		/* OK */
				default:
					fprintf(stderr, "Client %d aborted in state %d: %s",
							st->id, st->state, PQerrorMessage(st->con));
					PQclear(res);
					return clientDone(st, false);
			}
			PQclear(res);
			discard_response(st);
		}

		if (commands[st->state + 1] == NULL)
		{
			if (is_connect)
			{
				PQfinish(st->con);
				st->con = NULL;
			}

			++st->cnt;
			if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
				return clientDone(st, true);	/* exit success */
		}

		/* increment state counter */
		st->state++;
		if (commands[st->state] == NULL)
		{
			st->state = 0;
			st->use_file = getrand(thread, 0, num_files - 1);
			commands = sql_files[st->use_file];
		}
	}

	if (st->con == NULL)
	{
		instr_time	start,
					end;

		INSTR_TIME_SET_CURRENT(start);
		if ((st->con = doConnect()) == NULL)
		{
			fprintf(stderr, "Client %d aborted in establishing connection.\n", st->id);
			return clientDone(st, false);
		}
		INSTR_TIME_SET_CURRENT(end);
		INSTR_TIME_ACCUM_DIFF(*conn_time, end, start);
	}

	/* Record transaction start time if logging is enabled */
	if (logfile && st->state == 0)
		INSTR_TIME_SET_CURRENT(st->txn_begin);

	/* Record statement start time if per-command latencies are requested */
	if (is_latencies)
		INSTR_TIME_SET_CURRENT(st->stmt_begin);

	if (commands[st->state]->type == SQL_COMMAND)
	{
		const Command *command = commands[st->state];
		int			r;

		if (querymode == QUERY_SIMPLE)
		{
			char	   *sql;

			sql = xstrdup(command->argv[0]);
			sql = assignVariables(st, sql);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, sql);
			r = PQsendQuery(st->con, sql);
			free(sql);
		}
		else if (querymode == QUERY_EXTENDED)
		{
			const char *sql = command->argv[0];
			const char *params[MAX_ARGS];

			getQueryParams(st, command, params);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, sql);
			r = PQsendQueryParams(st->con, sql, command->argc - 1,
								  NULL, params, NULL, NULL, 0);
		}
		else if (querymode == QUERY_PREPARED)
		{
			char		name[MAX_PREPARE_NAME];
			const char *params[MAX_ARGS];

			if (!st->prepared[st->use_file])
			{
				int			j;

				for (j = 0; commands[j] != NULL; j++)
				{
					PGresult   *res;
					char		name[MAX_PREPARE_NAME];

					if (commands[j]->type != SQL_COMMAND)
						continue;
					preparedStatementName(name, st->use_file, j);
					res = PQprepare(st->con, name,
						  commands[j]->argv[0], commands[j]->argc - 1, NULL);
					if (PQresultStatus(res) != PGRES_COMMAND_OK)
						fprintf(stderr, "%s", PQerrorMessage(st->con));
					PQclear(res);
				}
				st->prepared[st->use_file] = true;
			}

			getQueryParams(st, command, params);
			preparedStatementName(name, st->use_file, st->state);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, name);
			r = PQsendQueryPrepared(st->con, name, command->argc - 1,
									params, NULL, NULL, 0);
		}
		else	/* unknown sql mode */
			r = 0;

		if (r == 0)
		{
			if (debug)
				fprintf(stderr, "client %d cannot send %s\n", st->id, command->argv[0]);
			st->ecnt++;
		}
		else
			st->listen = 1;		/* flags that should be listened */
	}
	else if (commands[st->state]->type == META_COMMAND)
	{
		int			argc = commands[st->state]->argc,
					i;
		char	  **argv = commands[st->state]->argv;

		if (debug)
		{
			fprintf(stderr, "client %d executing \\%s", st->id, argv[0]);
			for (i = 1; i < argc; i++)
				fprintf(stderr, " %s", argv[i]);
			fprintf(stderr, "\n");
		}

		if (pg_strcasecmp(argv[0], "setrandom") == 0)
		{
			char	   *var;
			int			min,
						max;
			char		res[64];

			if (*argv[2] == ':')
			{
				if ((var = getVariable(st, argv[2] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[2]);
					st->ecnt++;
					return true;
				}
				min = atoi(var);
			}
			else
				min = atoi(argv[2]);

#ifdef NOT_USED
			if (min < 0)
			{
				fprintf(stderr, "%s: invalid minimum number %d\n", argv[0], min);
				st->ecnt++;
				return;
			}
#endif

			if (*argv[3] == ':')
			{
				if ((var = getVariable(st, argv[3] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[3]);
					st->ecnt++;
					return true;
				}
				max = atoi(var);
			}
			else
				max = atoi(argv[3]);

			if (max < min)
			{
				fprintf(stderr, "%s: maximum is less than minimum\n", argv[0]);
				st->ecnt++;
				return true;
			}

			/*
			 * getrand() neeeds to be able to subtract max from min and add
			 * one the result without overflowing.	Since we know max > min,
			 * we can detect overflow just by checking for a negative result.
			 * But we must check both that the subtraction doesn't overflow,
			 * and that adding one to the result doesn't overflow either.
			 */
			if (max - min < 0 || (max - min) + 1 < 0)
			{
				fprintf(stderr, "%s: range too large\n", argv[0]);
				st->ecnt++;
				return true;
			}

#ifdef DEBUG
			printf("min: %d max: %d random: %d\n", min, max, getrand(thread, min, max));
#endif
			snprintf(res, sizeof(res), "%d", getrand(thread, min, max));

			if (!putVariable(st, argv[0], argv[1], res))
			{
				st->ecnt++;
				return true;
			}

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "set") == 0)
		{
			char	   *var;
			int			ope1,
						ope2;
			char		res[64];

			if (*argv[2] == ':')
			{
				if ((var = getVariable(st, argv[2] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[2]);
					st->ecnt++;
					return true;
				}
				ope1 = atoi(var);
			}
			else
				ope1 = atoi(argv[2]);

			if (argc < 5)
				snprintf(res, sizeof(res), "%d", ope1);
			else
			{
				if (*argv[4] == ':')
				{
					if ((var = getVariable(st, argv[4] + 1)) == NULL)
					{
						fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[4]);
						st->ecnt++;
						return true;
					}
					ope2 = atoi(var);
				}
				else
					ope2 = atoi(argv[4]);

				if (strcmp(argv[3], "+") == 0)
					snprintf(res, sizeof(res), "%d", ope1 + ope2);
				else if (strcmp(argv[3], "-") == 0)
					snprintf(res, sizeof(res), "%d", ope1 - ope2);
				else if (strcmp(argv[3], "*") == 0)
					snprintf(res, sizeof(res), "%d", ope1 * ope2);
				else if (strcmp(argv[3], "/") == 0)
				{
					if (ope2 == 0)
					{
						fprintf(stderr, "%s: division by zero\n", argv[0]);
						st->ecnt++;
						return true;
					}
					snprintf(res, sizeof(res), "%d", ope1 / ope2);
				}
				else
				{
					fprintf(stderr, "%s: unsupported operator %s\n", argv[0], argv[3]);
					st->ecnt++;
					return true;
				}
			}

			if (!putVariable(st, argv[0], argv[1], res))
			{
				st->ecnt++;
				return true;
			}

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "sleep") == 0)
		{
			char	   *var;
			int			usec;
			instr_time	now;

			if (*argv[1] == ':')
			{
				if ((var = getVariable(st, argv[1] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[1]);
					st->ecnt++;
					return true;
				}
				usec = atoi(var);
			}
			else
				usec = atoi(argv[1]);

			if (argc > 2)
			{
				if (pg_strcasecmp(argv[2], "ms") == 0)
					usec *= 1000;
				else if (pg_strcasecmp(argv[2], "s") == 0)
					usec *= 1000000;
			}
			else
				usec *= 1000000;

			INSTR_TIME_SET_CURRENT(now);
			st->until = INSTR_TIME_GET_MICROSEC(now) + usec;
			st->sleeping = 1;

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "setshell") == 0)
		{
			bool		ret = runShellCommand(st, argv[1], argv + 2, argc - 2);

			if (timer_exceeded) /* timeout */
				return clientDone(st, true);
			else if (!ret)		/* on error */
			{
				st->ecnt++;
				return true;
			}
			else	/* succeeded */
				st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "shell") == 0)
		{
			bool		ret = runShellCommand(st, NULL, argv + 1, argc - 1);

			if (timer_exceeded) /* timeout */
				return clientDone(st, true);
			else if (!ret)		/* on error */
			{
				st->ecnt++;
				return true;
			}
			else	/* succeeded */
				st->listen = 1;
		}
		goto top;
	}

	return true;
}

/* discard connections */
static void
disconnect_all(CState *state, int length)
{
	int			i;

	for (i = 0; i < length; i++)
	{
		if (state[i].con)
		{
			PQfinish(state[i].con);
			state[i].con = NULL;
		}
	}
}

/* create tables and setup data */
static void
init(bool is_no_vacuum)
{
	/*
	 * Note: TPC-B requires at least 100 bytes per row, and the "filler"
	 * fields in these table declarations were intended to comply with that.
	 * But because they default to NULLs, they don't actually take any space.
	 * We could fix that by giving them non-null default values. However, that
	 * would completely break comparability of pgbench results with prior
	 * versions.  Since pgbench has never pretended to be fully TPC-B
	 * compliant anyway, we stick with the historical behavior.
	 */
	struct ddlinfo
	{
		char	   *table;
		char	   *cols;
		int			declare_fillfactor;
	};
	struct ddlinfo DDLs[] = {
		{
			"pgbench_history",
			"tid int,bid int,aid int,delta int,mtime timestamp,filler char(22)",
			0
		},
		{
			"pgbench_tellers",
			"tid int not null,bid int,tbalance int,filler char(22),"
			"tbalance1 int, filler1 varchar(1),tbalance2 int,filler2 char(1770)",
			1
		},
		{
			"pgbench_accounts",
			"aid int not null,bid int,abalance int,filler char(22),"
			"abalance1 int,filler1 varchar(1),abalance2 int,filler2 char(1770)",
			1
		},
		{
			"pgbench_branches",
			"bid int not null,bbalance int,filler char(22),bbalance1 int,"
			"filler1 varchar(1), bbalance2 int, filler2 char(1770)",
			1
		}
	};
	static char *DDLAFTERs[] = {
		"alter table pgbench_branches add primary key (bid)",
		"alter table pgbench_tellers add primary key (tid)",
		"alter table pgbench_accounts add primary key (aid)"
	};
	static char *DDLKEYs[] = {
		"alter table pgbench_tellers add foreign key (bid) references pgbench_branches",
		"alter table pgbench_accounts add foreign key (bid) references pgbench_branches",
		"alter table pgbench_history add foreign key (bid) references pgbench_branches",
		"alter table pgbench_history add foreign key (tid) references pgbench_tellers",
		"alter table pgbench_history add foreign key (aid) references pgbench_accounts"
	};

	PGconn	   *con;
	PGresult   *res;
	char		sql[256];
	int			i;

	if ((con = doConnect()) == NULL)
		exit(1);

	for (i = 0; i < lengthof(DDLs); i++)
	{
		char		opts[256];
		char		buffer[256];
		struct ddlinfo *ddl = &DDLs[i];

		/* Remove old table, if it exists. */
		snprintf(buffer, 256, "drop table if exists %s", ddl->table);
		executeStatement(con, buffer);

		/* Construct new create table statement. */
		opts[0] = '\0';
		if (ddl->declare_fillfactor)
			snprintf(opts + strlen(opts), 256 - strlen(opts),
					 " with (fillfactor=%d)", fillfactor);
		if (tablespace != NULL)
		{
			char	   *escape_tablespace;

			escape_tablespace = PQescapeIdentifier(con, tablespace,
												   strlen(tablespace));
			snprintf(opts + strlen(opts), 256 - strlen(opts),
					 " tablespace %s", escape_tablespace);
			PQfreemem(escape_tablespace);
		}
		snprintf(buffer, 256, "create%s table %s(%s)%s",
				 unlogged_tables ? " unlogged" : "",
				 ddl->table, ddl->cols, opts);

		executeStatement(con, buffer);
	}

	executeStatement(con, "begin");

	for (i = 0; i < nbranches * scale; i++)
	{
		snprintf(sql, 256, "insert into pgbench_branches values(%d,0,0,0,0,0,0)", i + 1);
		executeStatement(con, sql);
	}

	for (i = 0; i < ntellers * scale; i++)
	{
		snprintf(sql, 256, "insert into pgbench_tellers values (%d,%d,0,0,0,0,0,0)",
				 i + 1, i / ntellers + 1);
		executeStatement(con, sql);
	}

	executeStatement(con, "commit");

	/*
	 * fill the pgbench_accounts table with some data
	 */
	fprintf(stderr, "creating tables...\n");

	executeStatement(con, "begin");
	executeStatement(con, "truncate pgbench_accounts");

	res = PQexec(con, "copy pgbench_accounts from stdin");
	if (PQresultStatus(res) != PGRES_COPY_IN)
	{
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}
	PQclear(res);

	for (i = 0; i < naccounts * scale; i++)
	{
		int			j = i + 1;

		snprintf(sql, 256, "%d\t%d\t%d\t \t%d\t \t%d\t \n", j, i / naccounts + 1, 0,0,0);
		if (PQputline(con, sql))
		{
			fprintf(stderr, "PQputline failed\n");
			exit(1);
		}

		if (j % 100000 == 0)
			fprintf(stderr, "%d of %d tuples (%d%%) done.\n",
					j, naccounts * scale,
					j * 100 / (naccounts * scale));
	}
	if (PQputline(con, "\\.\n"))
	{
		fprintf(stderr, "very last PQputline failed\n");
		exit(1);
	}
	if (PQendcopy(con))
	{
		fprintf(stderr, "PQendcopy failed\n");
		exit(1);
	}
	executeStatement(con, "commit");

	/* vacuum */
	if (!is_no_vacuum)
	{
		fprintf(stderr, "vacuum...\n");
		executeStatement(con, "vacuum analyze pgbench_branches");
		executeStatement(con, "vacuum analyze pgbench_tellers");
		executeStatement(con, "vacuum analyze pgbench_accounts");
		executeStatement(con, "vacuum analyze pgbench_history");
	}

	/*
	 * create indexes
	 */
	fprintf(stderr, "set primary keys...\n");
	for (i = 0; i < lengthof(DDLAFTERs); i++)
	{
		char		buffer[256];

		strncpy(buffer, DDLAFTERs[i], 256);

		if (index_tablespace != NULL)
		{
			char	   *escape_tablespace;

			escape_tablespace = PQescapeIdentifier(con, index_tablespace,
												   strlen(index_tablespace));
			snprintf(buffer + strlen(buffer), 256 - strlen(buffer),
					 " using index tablespace %s", escape_tablespace);
			PQfreemem(escape_tablespace);
		}

		executeStatement(con, buffer);
	}

	/*
	 * create foreign keys
	 */
	if (foreign_keys)
	{
		fprintf(stderr, "set foreign keys...\n");
		for (i = 0; i < lengthof(DDLKEYs); i++)
		{
			executeStatement(con, DDLKEYs[i]);
		}
	}


	fprintf(stderr, "done.\n");
	PQfinish(con);
}

/*
 * Parse the raw sql and replace :param to $n.
 */
static bool
parseQuery(Command *cmd, const char *raw_sql)
{
	char	   *sql,
			   *p;

	sql = xstrdup(raw_sql);
	cmd->argc = 1;

	p = sql;
	while ((p = strchr(p, ':')) != NULL)
	{
		char		var[12];
		char	   *name;
		int			eaten;

		name = parseVariable(p, &eaten);
		if (name == NULL)
		{
			while (*p == ':')
			{
				p++;
			}
			continue;
		}

		if (cmd->argc >= MAX_ARGS)
		{
			fprintf(stderr, "statement has too many arguments (maximum is %d): %s\n", MAX_ARGS - 1, raw_sql);
			return false;
		}

		sprintf(var, "$%d", cmd->argc);
		p = replaceVariable(&sql, p, eaten, var);

		cmd->argv[cmd->argc] = name;
		cmd->argc++;
	}

	cmd->argv[0] = sql;
	return true;
}

/* Parse a command; return a Command struct, or NULL if it's a comment */
static Command *
process_commands(char *buf)
{
	const char	delim[] = " \f\n\r\t\v";

	Command    *my_commands;
	int			j;
	char	   *p,
			   *tok;

	/* Make the string buf end at the next newline */
	if ((p = strchr(buf, '\n')) != NULL)
		*p = '\0';

	/* Skip leading whitespace */
	p = buf;
	while (isspace((unsigned char) *p))
		p++;

	/* If the line is empty or actually a comment, we're done */
	if (*p == '\0' || strncmp(p, "--", 2) == 0)
		return NULL;

	/* Allocate and initialize Command structure */
	my_commands = (Command *) xmalloc(sizeof(Command));
	my_commands->line = xstrdup(buf);
	my_commands->command_num = num_commands++;
	my_commands->type = 0;		/* until set */
	my_commands->argc = 0;

	if (*p == '\\')
	{
		my_commands->type = META_COMMAND;

		j = 0;
		tok = strtok(++p, delim);

		while (tok != NULL)
		{
			my_commands->argv[j++] = xstrdup(tok);
			my_commands->argc++;
			tok = strtok(NULL, delim);
		}

		if (pg_strcasecmp(my_commands->argv[0], "setrandom") == 0)
		{
			if (my_commands->argc < 4)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			for (j = 4; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
		{
			if (my_commands->argc < 3)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			for (j = my_commands->argc < 5 ? 3 : 5; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "sleep") == 0)
		{
			if (my_commands->argc < 2)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			/*
			 * Split argument into number and unit to allow "sleep 1ms" etc.
			 * We don't have to terminate the number argument with null
			 * because it will be parsed with atoi, which ignores trailing
			 * non-digit characters.
			 */
			if (my_commands->argv[1][0] != ':')
			{
				char	   *c = my_commands->argv[1];

				while (isdigit((unsigned char) *c))
					c++;
				if (*c)
				{
					my_commands->argv[2] = c;
					if (my_commands->argc < 3)
						my_commands->argc = 3;
				}
			}

			if (my_commands->argc >= 3)
			{
				if (pg_strcasecmp(my_commands->argv[2], "us") != 0 &&
					pg_strcasecmp(my_commands->argv[2], "ms") != 0 &&
					pg_strcasecmp(my_commands->argv[2], "s") != 0)
				{
					fprintf(stderr, "%s: unknown time unit '%s' - must be us, ms or s\n",
							my_commands->argv[0], my_commands->argv[2]);
					exit(1);
				}
			}

			for (j = 3; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "setshell") == 0)
		{
			if (my_commands->argc < 3)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}
		}
		else if (pg_strcasecmp(my_commands->argv[0], "shell") == 0)
		{
			if (my_commands->argc < 1)
			{
				fprintf(stderr, "%s: missing command\n", my_commands->argv[0]);
				exit(1);
			}
		}
		else
		{
			fprintf(stderr, "Invalid command %s\n", my_commands->argv[0]);
			exit(1);
		}
	}
	else
	{
		my_commands->type = SQL_COMMAND;

		switch (querymode)
		{
			case QUERY_SIMPLE:
				my_commands->argv[0] = xstrdup(p);
				my_commands->argc++;
				break;
			case QUERY_EXTENDED:
			case QUERY_PREPARED:
				if (!parseQuery(my_commands, p))
					exit(1);
				break;
			default:
				exit(1);
		}
	}

	return my_commands;
}

static int
process_file(char *filename)
{
#define COMMANDS_ALLOC_NUM 128

	Command   **my_commands;
	FILE	   *fd;
	int			lineno;
	char		buf[BUFSIZ];
	int			alloc_num;

	if (num_files >= MAX_FILES)
	{
		fprintf(stderr, "Up to only %d SQL files are allowed\n", MAX_FILES);
		exit(1);
	}

	alloc_num = COMMANDS_ALLOC_NUM;
	my_commands = (Command **) xmalloc(sizeof(Command *) * alloc_num);

	if (strcmp(filename, "-") == 0)
		fd = stdin;
	else if ((fd = fopen(filename, "r")) == NULL)
	{
		fprintf(stderr, "%s: %s\n", filename, strerror(errno));
		return false;
	}

	lineno = 0;

	while (fgets(buf, sizeof(buf), fd) != NULL)
	{
		Command    *command;

		command = process_commands(buf);
		if (command == NULL)
			continue;

		my_commands[lineno] = command;
		lineno++;

		if (lineno >= alloc_num)
		{
			alloc_num += COMMANDS_ALLOC_NUM;
			my_commands = xrealloc(my_commands, sizeof(Command *) * alloc_num);
		}
	}
	fclose(fd);

	my_commands[lineno] = NULL;

	sql_files[num_files++] = my_commands;

	return true;
}

static Command **
process_builtin(char *tb)
{
#define COMMANDS_ALLOC_NUM 128

	Command   **my_commands;
	int			lineno;
	char		buf[BUFSIZ];
	int			alloc_num;

	alloc_num = COMMANDS_ALLOC_NUM;
	my_commands = (Command **) xmalloc(sizeof(Command *) * alloc_num);

	lineno = 0;

	for (;;)
	{
		char	   *p;
		Command    *command;

		p = buf;
		while (*tb && *tb != '\n')
			*p++ = *tb++;

		if (*tb == '\0')
			break;

		if (*tb == '\n')
			tb++;

		*p = '\0';

		command = process_commands(buf);
		if (command == NULL)
			continue;

		my_commands[lineno] = command;
		lineno++;

		if (lineno >= alloc_num)
		{
			alloc_num += COMMANDS_ALLOC_NUM;
			my_commands = xrealloc(my_commands, sizeof(Command *) * alloc_num);
		}
	}

	my_commands[lineno] = NULL;

	return my_commands;
}

/* print out results */
static void
printResults(int ttype, int normal_xacts, int nclients,
			 TState *threads, int nthreads,
			 instr_time total_time, instr_time conn_total_time)
{
	double		time_include,
				tps_include,
				tps_exclude;
	char	   *s;

	time_include = INSTR_TIME_GET_DOUBLE(total_time);
	tps_include = normal_xacts / time_include;
	tps_exclude = normal_xacts / (time_include -
						(INSTR_TIME_GET_DOUBLE(conn_total_time) / nthreads));

	if (ttype == 0)
		s = "TPC-B (sort of)";
	else if (ttype == 2)
		s = "Update only pgbench_accounts";
	else if (ttype == 1)
		s = "SELECT only";
	else
		s = "Custom query";

	printf("transaction type: %s\n", s);
	printf("scaling factor: %d\n", scale);
	printf("query mode: %s\n", QUERYMODE[querymode]);
	printf("number of clients: %d\n", nclients);
	printf("number of threads: %d\n", nthreads);
	if (duration <= 0)
	{
		printf("number of transactions per client: %d\n", nxacts);
		printf("number of transactions actually processed: %d/%d\n",
			   normal_xacts, nxacts * nclients);
	}
	else
	{
		printf("duration: %d s\n", duration);
		printf("number of transactions actually processed: %d\n",
			   normal_xacts);
	}
	printf("tps = %f (including connections establishing)\n", tps_include);
	printf("tps = %f (excluding connections establishing)\n", tps_exclude);

	/* Report per-command latencies */
	if (is_latencies)
	{
		int			i;

		for (i = 0; i < num_files; i++)
		{
			Command   **commands;

			if (num_files > 1)
				printf("statement latencies in milliseconds, file %d:\n", i + 1);
			else
				printf("statement latencies in milliseconds:\n");

			for (commands = sql_files[i]; *commands != NULL; commands++)
			{
				Command    *command = *commands;
				int			cnum = command->command_num;
				double		total_time;
				instr_time	total_exec_elapsed;
				int			total_exec_count;
				int			t;

				/* Accumulate per-thread data for command */
				INSTR_TIME_SET_ZERO(total_exec_elapsed);
				total_exec_count = 0;
				for (t = 0; t < nthreads; t++)
				{
					TState	   *thread = &threads[t];

					INSTR_TIME_ADD(total_exec_elapsed,
								   thread->exec_elapsed[cnum]);
					total_exec_count += thread->exec_count[cnum];
				}

				if (total_exec_count > 0)
					total_time = INSTR_TIME_GET_MILLISEC(total_exec_elapsed) / (double) total_exec_count;
				else
					total_time = 0.0;

				printf("\t%f\t%s\n", total_time, command->line);
			}
		}
	}
}


int
main(int argc, char **argv)
{
	int			c;
	int			nclients = 1;	/* default number of simulated clients */
	int			nthreads = 1;	/* default number of threads */
	int			is_init_mode = 0;		/* initialize mode? */
	int			is_no_vacuum = 0;		/* no vacuum at all before testing? */
	int			do_vacuum_accounts = 0; /* do vacuum accounts before testing? */
	int			ttype = 0;		/* transaction type. 0: TPC-B, 1: SELECT only,
								 * 2: skip update of branches and tellers */
	int			optindex;
	char	   *filename = NULL;
	bool		scale_given = false;

	CState	   *state;			/* status of clients */
	TState	   *threads;		/* array of thread */

	instr_time	start_time;		/* start up time */
	instr_time	total_time;
	instr_time	conn_total_time;
	int			total_xacts;

	int			i;

	static struct option long_options[] = {
		{"foreign-keys", no_argument, &foreign_keys, 1},
		{"index-tablespace", required_argument, NULL, 3},
		{"tablespace", required_argument, NULL, 2},
		{"unlogged-tables", no_argument, &unlogged_tables, 1},
		{NULL, 0, NULL, 0}
	};

#ifdef HAVE_GETRLIMIT
	struct rlimit rlim;
#endif

	PGconn	   *con;
	PGresult   *res;
	char	   *env;

	char		val[64];

	progname = get_progname(argv[0]);

	if (argc > 1)
	{
		if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-?") == 0)
		{
			usage();
			exit(0);
		}
		if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
		{
			puts("pgbench (PostgreSQL) " PG_VERSION);
			exit(0);
		}
	}

#ifdef WIN32
	/* stderr is buffered on Win32. */
	setvbuf(stderr, NULL, _IONBF, 0);
#endif

	if ((env = getenv("PGHOST")) != NULL && *env != '\0')
		pghost = env;
	if ((env = getenv("PGPORT")) != NULL && *env != '\0')
		pgport = env;
	else if ((env = getenv("PGUSER")) != NULL && *env != '\0')
		login = env;

	state = (CState *) xmalloc(sizeof(CState));
	memset(state, 0, sizeof(CState));

	while ((c = getopt_long(argc, argv, "ih:nvp:dSNc:j:Crs:t:T:U:lf:D:F:M:", long_options, &optindex)) != -1)
	{
		switch (c)
		{
			case 'i':
				is_init_mode++;
				break;
			case 'h':
				pghost = optarg;
				break;
			case 'n':
				is_no_vacuum++;
				break;
			case 'v':
				do_vacuum_accounts++;
				break;
			case 'p':
				pgport = optarg;
				break;
			case 'd':
				debug++;
				break;
			case 'S':
				ttype = 1;
				break;
			case 'N':
				ttype = 2;
				break;
			case 'c':
				nclients = atoi(optarg);
				if (nclients <= 0 || nclients > MAXCLIENTS)
				{
					fprintf(stderr, "invalid number of clients: %d\n", nclients);
					exit(1);
				}
#ifdef HAVE_GETRLIMIT
#ifdef RLIMIT_NOFILE			/* most platforms use RLIMIT_NOFILE */
				if (getrlimit(RLIMIT_NOFILE, &rlim) == -1)
#else							/* but BSD doesn't ... */
				if (getrlimit(RLIMIT_OFILE, &rlim) == -1)
#endif   /* RLIMIT_NOFILE */
				{
					fprintf(stderr, "getrlimit failed: %s\n", strerror(errno));
					exit(1);
				}
				if (rlim.rlim_cur <= (nclients + 2))
				{
					fprintf(stderr, "You need at least %d open files but you are only allowed to use %ld.\n", nclients + 2, (long) rlim.rlim_cur);
					fprintf(stderr, "Use limit/ulimit to increase the limit before using pgbench.\n");
					exit(1);
				}
#endif   /* HAVE_GETRLIMIT */
				break;
			case 'j':			/* jobs */
				nthreads = atoi(optarg);
				if (nthreads <= 0)
				{
					fprintf(stderr, "invalid number of threads: %d\n", nthreads);
					exit(1);
				}
				break;
			case 'C':
				is_connect = true;
				break;
			case 'r':
				is_latencies = true;
				break;
			case 's':
				scale_given = true;
				scale = atoi(optarg);
				if (scale <= 0)
				{
					fprintf(stderr, "invalid scaling factor: %d\n", scale);
					exit(1);
				}
				break;
			case 't':
				if (duration > 0)
				{
					fprintf(stderr, "specify either a number of transactions (-t) or a duration (-T), not both.\n");
					exit(1);
				}
				nxacts = atoi(optarg);
				if (nxacts <= 0)
				{
					fprintf(stderr, "invalid number of transactions: %d\n", nxacts);
					exit(1);
				}
				break;
			case 'T':
				if (nxacts > 0)
				{
					fprintf(stderr, "specify either a number of transactions (-t) or a duration (-T), not both.\n");
					exit(1);
				}
				duration = atoi(optarg);
				if (duration <= 0)
				{
					fprintf(stderr, "invalid duration: %d\n", duration);
					exit(1);
				}
				break;
			case 'U':
				login = optarg;
				break;
			case 'l':
				use_log = true;
				break;
			case 'f':
				ttype = 3;
				filename = optarg;
				if (process_file(filename) == false || *sql_files[num_files - 1] == NULL)
					exit(1);
				break;
			case 'D':
				{
					char	   *p;

					if ((p = strchr(optarg, '=')) == NULL || p == optarg || *(p + 1) == '\0')
					{
						fprintf(stderr, "invalid variable definition: %s\n", optarg);
						exit(1);
					}

					*p++ = '\0';
					if (!putVariable(&state[0], "option", optarg, p))
						exit(1);
				}
				break;
			case 'F':
				fillfactor = atoi(optarg);
				if ((fillfactor < 10) || (fillfactor > 100))
				{
					fprintf(stderr, "invalid fillfactor: %d\n", fillfactor);
					exit(1);
				}
				break;
			case 'M':
				if (num_files > 0)
				{
					fprintf(stderr, "query mode (-M) should be specifiled before transaction scripts (-f)\n");
					exit(1);
				}
				for (querymode = 0; querymode < NUM_QUERYMODE; querymode++)
					if (strcmp(optarg, QUERYMODE[querymode]) == 0)
						break;
				if (querymode >= NUM_QUERYMODE)
				{
					fprintf(stderr, "invalid query mode (-M): %s\n", optarg);
					exit(1);
				}
				break;
			case 0:
				/* This covers long options which take no argument. */
				break;
			case 2:				/* tablespace */
				tablespace = optarg;
				break;
			case 3:				/* index-tablespace */
				index_tablespace = optarg;
				break;
			default:
				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
				exit(1);
				break;
		}
	}

	if (argc > optind)
		dbName = argv[optind];
	else
	{
		if ((env = getenv("PGDATABASE")) != NULL && *env != '\0')
			dbName = env;
		else if (login != NULL && *login != '\0')
			dbName = login;
		else
			dbName = "";
	}

	if (is_init_mode)
	{
		init(is_no_vacuum);
		exit(0);
	}

	/* Use DEFAULT_NXACTS if neither nxacts nor duration is specified. */
	if (nxacts <= 0 && duration <= 0)
		nxacts = DEFAULT_NXACTS;

	if (nclients % nthreads != 0)
	{
		fprintf(stderr, "number of clients (%d) must be a multiple of number of threads (%d)\n", nclients, nthreads);
		exit(1);
	}

	/*
	 * is_latencies only works with multiple threads in thread-based
	 * implementations, not fork-based ones, because it supposes that the
	 * parent can see changes made to the per-thread execution stats by child
	 * threads.  It seems useful enough to accept despite this limitation, but
	 * perhaps we should FIXME someday (by passing the stats data back up
	 * through the parent-to-child pipes).
	 */
#ifndef ENABLE_THREAD_SAFETY
	if (is_latencies && nthreads > 1)
	{
		fprintf(stderr, "-r does not work with -j larger than 1 on this platform.\n");
		exit(1);
	}
#endif

	/*
	 * save main process id in the global variable because process id will be
	 * changed after fork.
	 */
	main_pid = (int) getpid();

	if (nclients > 1)
	{
		state = (CState *) xrealloc(state, sizeof(CState) * nclients);
		memset(state + 1, 0, sizeof(CState) * (nclients - 1));

		/* copy any -D switch values to all clients */
		for (i = 1; i < nclients; i++)
		{
			int			j;

			state[i].id = i;
			for (j = 0; j < state[0].nvariables; j++)
			{
				if (!putVariable(&state[i], "startup", state[0].variables[j].name, state[0].variables[j].value))
					exit(1);
			}
		}
	}

	if (debug)
	{
		if (duration <= 0)
			printf("pghost: %s pgport: %s nclients: %d nxacts: %d dbName: %s\n",
				   pghost, pgport, nclients, nxacts, dbName);
		else
			printf("pghost: %s pgport: %s nclients: %d duration: %d dbName: %s\n",
				   pghost, pgport, nclients, duration, dbName);
	}

	/* opening connection... */
	con = doConnect();
	if (con == NULL)
		exit(1);

	if (PQstatus(con) == CONNECTION_BAD)
	{
		fprintf(stderr, "Connection to database '%s' failed.\n", dbName);
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}

	if (ttype != 3)
	{
		/*
		 * get the scaling factor that should be same as count(*) from
		 * pgbench_branches if this is not a custom query
		 */
		res = PQexec(con, "select count(*) from pgbench_branches");
		if (PQresultStatus(res) != PGRES_TUPLES_OK)
		{
			fprintf(stderr, "%s", PQerrorMessage(con));
			exit(1);
		}
		scale = atoi(PQgetvalue(res, 0, 0));
		if (scale < 0)
		{
			fprintf(stderr, "count(*) from pgbench_branches invalid (%d)\n", scale);
			exit(1);
		}
		PQclear(res);

		/* warn if we override user-given -s switch */
		if (scale_given)
			fprintf(stderr,
			"Scale option ignored, using pgbench_branches table count = %d\n",
					scale);
	}

	/*
	 * :scale variables normally get -s or database scale, but don't override
	 * an explicit -D switch
	 */
	if (getVariable(&state[0], "scale") == NULL)
	{
		snprintf(val, sizeof(val), "%d", scale);
		for (i = 0; i < nclients; i++)
		{
			if (!putVariable(&state[i], "startup", "scale", val))
				exit(1);
		}
	}

	if (!is_no_vacuum)
	{
		fprintf(stderr, "starting vacuum...");
		executeStatement(con, "vacuum pgbench_branches");
		executeStatement(con, "vacuum pgbench_tellers");
		executeStatement(con, "truncate pgbench_history");
		fprintf(stderr, "end.\n");

		if (do_vacuum_accounts)
		{
			fprintf(stderr, "starting vacuum pgbench_accounts...");
			executeStatement(con, "vacuum analyze pgbench_accounts");
			fprintf(stderr, "end.\n");
		}
	}
	PQfinish(con);

	/* set random seed */
	INSTR_TIME_SET_CURRENT(start_time);
	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));

	/* process builtin SQL scripts */
	switch (ttype)
	{
		case 0:
			sql_files[0] = process_builtin(tpc_b);
			num_files = 1;
			break;

		case 1:
			sql_files[0] = process_builtin(select_only);
			num_files = 1;
			break;

		case 2:
			sql_files[0] = process_builtin(simple_update);
			num_files = 1;
			break;

		default:
			break;
	}

	/* set up thread data structures */
	threads = (TState *) xmalloc(sizeof(TState) * nthreads);
	for (i = 0; i < nthreads; i++)
	{
		TState	   *thread = &threads[i];

		thread->tid = i;
		thread->state = &state[nclients / nthreads * i];
		thread->nstate = nclients / nthreads;
		thread->random_state[0] = random();
		thread->random_state[1] = random();
		thread->random_state[2] = random();

		if (is_latencies)
		{
			/* Reserve memory for the thread to store per-command latencies */
			int			t;

			thread->exec_elapsed = (instr_time *)
				xmalloc(sizeof(instr_time) * num_commands);
			thread->exec_count = (int *)
				xmalloc(sizeof(int) * num_commands);

			for (t = 0; t < num_commands; t++)
			{
				INSTR_TIME_SET_ZERO(thread->exec_elapsed[t]);
				thread->exec_count[t] = 0;
			}
		}
		else
		{
			thread->exec_elapsed = NULL;
			thread->exec_count = NULL;
		}
	}

	/* get start up time */
	INSTR_TIME_SET_CURRENT(start_time);

	/* set alarm if duration is specified. */
	if (duration > 0)
		setalarm(duration);

	/* start threads */
	for (i = 0; i < nthreads; i++)
	{
		TState	   *thread = &threads[i];

		INSTR_TIME_SET_CURRENT(thread->start_time);

		/* the first thread (i = 0) is executed by main thread */
		if (i > 0)
		{
			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);

			if (err != 0 || thread->thread == INVALID_THREAD)
			{
				fprintf(stderr, "cannot create thread: %s\n", strerror(err));
				exit(1);
			}
		}
		else
		{
			thread->thread = INVALID_THREAD;
		}
	}

	/* wait for threads and accumulate results */
	total_xacts = 0;
	INSTR_TIME_SET_ZERO(conn_total_time);
	for (i = 0; i < nthreads; i++)
	{
		void	   *ret = NULL;

		if (threads[i].thread == INVALID_THREAD)
			ret = threadRun(&threads[i]);
		else
			pthread_join(threads[i].thread, &ret);

		if (ret != NULL)
		{
			TResult    *r = (TResult *) ret;

			total_xacts += r->xacts;
			INSTR_TIME_ADD(conn_total_time, r->conn_time);
			free(ret);
		}
	}
	disconnect_all(state, nclients);

	/* get end time */
	INSTR_TIME_SET_CURRENT(total_time);
	INSTR_TIME_SUBTRACT(total_time, start_time);
	printResults(ttype, total_xacts, nclients, threads, nthreads,
				 total_time, conn_total_time);

	return 0;
}

static void *
threadRun(void *arg)
{
	TState	   *thread = (TState *) arg;
	CState	   *state = thread->state;
	TResult    *result;
	FILE	   *logfile = NULL; /* per-thread log file */
	instr_time	start,
				end;
	int			nstate = thread->nstate;
	int			remains = nstate;		/* number of remaining clients */
	int			i;

	result = xmalloc(sizeof(TResult));
	INSTR_TIME_SET_ZERO(result->conn_time);

	/* open log file if requested */
	if (use_log)
	{
		char		logpath[64];

		if (thread->tid == 0)
			snprintf(logpath, sizeof(logpath), "pgbench_log.%d", main_pid);
		else
			snprintf(logpath, sizeof(logpath), "pgbench_log.%d.%d", main_pid, thread->tid);
		logfile = fopen(logpath, "w");

		if (logfile == NULL)
		{
			fprintf(stderr, "Couldn't open logfile \"%s\": %s", logpath, strerror(errno));
			goto done;
		}
	}

	if (!is_connect)
	{
		/* make connections to the database */
		for (i = 0; i < nstate; i++)
		{
			if ((state[i].con = doConnect()) == NULL)
				goto done;
		}
	}

	/* time after thread and connections set up */
	INSTR_TIME_SET_CURRENT(result->conn_time);
	INSTR_TIME_SUBTRACT(result->conn_time, thread->start_time);

	/* send start up queries in async manner */
	for (i = 0; i < nstate; i++)
	{
		CState	   *st = &state[i];
		Command   **commands = sql_files[st->use_file];
		int			prev_ecnt = st->ecnt;

		st->use_file = getrand(thread, 0, num_files - 1);
		if (!doCustom(thread, st, &result->conn_time, logfile))
			remains--;			/* I've aborted */

		if (st->ecnt > prev_ecnt && commands[st->state]->type == META_COMMAND)
		{
			fprintf(stderr, "Client %d aborted in state %d. Execution meta-command failed.\n", i, st->state);
			remains--;			/* I've aborted */
			PQfinish(st->con);
			st->con = NULL;
		}
	}

	while (remains > 0)
	{
		fd_set		input_mask;
		int			maxsock;	/* max socket number to be waited */
		int64		now_usec = 0;
		int64		min_usec;

		FD_ZERO(&input_mask);

		maxsock = -1;
		min_usec = INT64_MAX;
		for (i = 0; i < nstate; i++)
		{
			CState	   *st = &state[i];
			Command   **commands = sql_files[st->use_file];
			int			sock;

			if (st->sleeping)
			{
				int			this_usec;

				if (min_usec == INT64_MAX)
				{
					instr_time	now;

					INSTR_TIME_SET_CURRENT(now);
					now_usec = INSTR_TIME_GET_MICROSEC(now);
				}

				this_usec = st->until - now_usec;
				if (min_usec > this_usec)
					min_usec = this_usec;
			}
			else if (st->con == NULL)
			{
				continue;
			}
			else if (commands[st->state]->type == META_COMMAND)
			{
				min_usec = 0;	/* the connection is ready to run */
				break;
			}

			sock = PQsocket(st->con);
			if (sock < 0)
			{
				fprintf(stderr, "bad socket: %s\n", strerror(errno));
				goto done;
			}

			FD_SET(sock, &input_mask);

			if (maxsock < sock)
				maxsock = sock;
		}

		if (min_usec > 0 && maxsock != -1)
		{
			int			nsocks; /* return from select(2) */

			if (min_usec != INT64_MAX)
			{
				struct timeval timeout;

				timeout.tv_sec = min_usec / 1000000;
				timeout.tv_usec = min_usec % 1000000;
				nsocks = select(maxsock + 1, &input_mask, NULL, NULL, &timeout);
			}
			else
				nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL);
			if (nsocks < 0)
			{
				if (errno == EINTR)
					continue;
				/* must be something wrong */
				fprintf(stderr, "select failed: %s\n", strerror(errno));
				goto done;
			}
		}

		/* ok, backend returns reply */
		for (i = 0; i < nstate; i++)
		{
			CState	   *st = &state[i];
			Command   **commands = sql_files[st->use_file];
			int			prev_ecnt = st->ecnt;

			if (st->con && (FD_ISSET(PQsocket(st->con), &input_mask)
							|| commands[st->state]->type == META_COMMAND))
			{
				if (!doCustom(thread, st, &result->conn_time, logfile))
					remains--;	/* I've aborted */
			}

			if (st->ecnt > prev_ecnt && commands[st->state]->type == META_COMMAND)
			{
				fprintf(stderr, "Client %d aborted in state %d. Execution of meta-command failed.\n", i, st->state);
				remains--;		/* I've aborted */
				PQfinish(st->con);
				st->con = NULL;
			}
		}
	}

done:
	INSTR_TIME_SET_CURRENT(start);
	disconnect_all(state, nstate);
	result->xacts = 0;
	for (i = 0; i < nstate; i++)
		result->xacts += state[i].cnt;
	INSTR_TIME_SET_CURRENT(end);
	INSTR_TIME_ACCUM_DIFF(result->conn_time, end, start);
	if (logfile)
		fclose(logfile);
	return result;
}


/*
 * Support for duration option: set timer_exceeded after so many seconds.
 */

#ifndef WIN32

static void
handle_sig_alarm(SIGNAL_ARGS)
{
	timer_exceeded = true;
}

static void
setalarm(int seconds)
{
	pqsignal(SIGALRM, handle_sig_alarm);
	alarm(seconds);
}

#ifndef ENABLE_THREAD_SAFETY

/*
 * implements pthread using fork.
 */

typedef struct fork_pthread
{
	pid_t		pid;
	int			pipes[2];
}	fork_pthread;

static int
pthread_create(pthread_t *thread,
			   pthread_attr_t *attr,
			   void *(*start_routine) (void *),
			   void *arg)
{
	fork_pthread *th;
	void	   *ret;

	th = (fork_pthread *) xmalloc(sizeof(fork_pthread));
	if (pipe(th->pipes) < 0)
	{
		free(th);
		return errno;
	}

	th->pid = fork();
	if (th->pid == -1)			/* error */
	{
		free(th);
		return errno;
	}
	if (th->pid != 0)			/* in parent process */
	{
		close(th->pipes[1]);
		*thread = th;
		return 0;
	}

	/* in child process */
	close(th->pipes[0]);

	/* set alarm again because the child does not inherit timers */
	if (duration > 0)
		setalarm(duration);

	ret = start_routine(arg);
	write(th->pipes[1], ret, sizeof(TResult));
	close(th->pipes[1]);
	free(th);
	exit(0);
}

static int
pthread_join(pthread_t th, void **thread_return)
{
	int			status;

	while (waitpid(th->pid, &status, 0) != th->pid)
	{
		if (errno != EINTR)
			return errno;
	}

	if (thread_return != NULL)
	{
		/* assume result is TResult */
		*thread_return = xmalloc(sizeof(TResult));
		if (read(th->pipes[0], *thread_return, sizeof(TResult)) != sizeof(TResult))
		{
			free(*thread_return);
			*thread_return = NULL;
		}
	}
	close(th->pipes[0]);

	free(th);
	return 0;
}
#endif
#else							/* WIN32 */

static VOID CALLBACK
win32_timer_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
{
	timer_exceeded = true;
}

static void
setalarm(int seconds)
{
	HANDLE		queue;
	HANDLE		timer;

	/* This function will be called at most once, so we can cheat a bit. */
	queue = CreateTimerQueue();
	if (seconds > ((DWORD) -1) / 1000 ||
		!CreateTimerQueueTimer(&timer, queue,
							   win32_timer_callback, NULL, seconds * 1000, 0,
							   WT_EXECUTEINTIMERTHREAD | WT_EXECUTEONLYONCE))
	{
		fprintf(stderr, "Failed to set timer\n");
		exit(1);
	}
}

/* partial pthread implementation for Windows */

typedef struct win32_pthread
{
	HANDLE		handle;
	void	   *(*routine) (void *);
	void	   *arg;
	void	   *result;
} win32_pthread;

static unsigned __stdcall
win32_pthread_run(void *arg)
{
	win32_pthread *th = (win32_pthread *) arg;

	th->result = th->routine(th->arg);

	return 0;
}

static int
pthread_create(pthread_t *thread,
			   pthread_attr_t *attr,
			   void *(*start_routine) (void *),
			   void *arg)
{
	int			save_errno;
	win32_pthread *th;

	th = (win32_pthread *) xmalloc(sizeof(win32_pthread));
	th->routine = start_routine;
	th->arg = arg;
	th->result = NULL;

	th->handle = (HANDLE) _beginthreadex(NULL, 0, win32_pthread_run, th, 0, NULL);
	if (th->handle == NULL)
	{
		save_errno = errno;
		free(th);
		return save_errno;
	}

	*thread = th;
	return 0;
}

static int
pthread_join(pthread_t th, void **thread_return)
{
	if (th == NULL || th->handle == NULL)
		return errno = EINVAL;

	if (WaitForSingleObject(th->handle, INFINITE) != WAIT_OBJECT_0)
	{
		_dosmaperr(GetLastError());
		return errno;
	}

	if (thread_return)
		*thread_return = th->result;

	CloseHandle(th->handle);
	free(th);
	return 0;
}

#endif   /* WIN32 */

pgbench_250.capplication/octet-stream; name=pgbench_250.cDownload

/*
 * pgbench.c
 *
 * A simple benchmark program for PostgreSQL
 * Originally written by Tatsuo Ishii and enhanced by many contributors.
 *
 * contrib/pgbench/pgbench.c
 * Copyright (c) 2000-2012, PostgreSQL Global Development Group
 * ALL RIGHTS RESERVED;
 *
 * Permission to use, copy, modify, and distribute this software and its
 * documentation for any purpose, without fee, and without a written agreement
 * is hereby granted, provided that the above copyright notice and this
 * paragraph and the following two paragraphs appear in all copies.
 *
 * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
 * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
 * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
 * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
 * POSSIBILITY OF SUCH DAMAGE.
 *
 * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIMS ANY WARRANTIES,
 * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
 * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
 * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
 * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
 *
 */

#ifdef WIN32
#define FD_SETSIZE 1024			/* set before winsock2.h is included */
#endif   /* ! WIN32 */

#include "postgres_fe.h"

#include "getopt_long.h"
#include "libpq-fe.h"
#include "libpq/pqsignal.h"
#include "portability/instr_time.h"

#include <ctype.h>

#ifndef WIN32
#include <sys/time.h>
#include <unistd.h>
#endif   /* ! WIN32 */

#ifdef HAVE_SYS_SELECT_H
#include <sys/select.h>
#endif

#ifdef HAVE_SYS_RESOURCE_H
#include <sys/resource.h>		/* for getrlimit */
#endif

#ifndef INT64_MAX
#define INT64_MAX	INT64CONST(0x7FFFFFFFFFFFFFFF)
#endif

/*
 * Multi-platform pthread implementations
 */

#ifdef WIN32
/* Use native win32 threads on Windows */
typedef struct win32_pthread *pthread_t;
typedef int pthread_attr_t;

static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
static int	pthread_join(pthread_t th, void **thread_return);
#elif defined(ENABLE_THREAD_SAFETY)
/* Use platform-dependent pthread capability */
#include <pthread.h>
#else
/* Use emulation with fork. Rename pthread identifiers to avoid conflicts */

#include <sys/wait.h>

#define pthread_t				pg_pthread_t
#define pthread_attr_t			pg_pthread_attr_t
#define pthread_create			pg_pthread_create
#define pthread_join			pg_pthread_join

typedef struct fork_pthread *pthread_t;
typedef int pthread_attr_t;

static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
static int	pthread_join(pthread_t th, void **thread_return);
#endif

extern char *optarg;
extern int	optind;


/********************************************************************
 * some configurable parameters */

/* max number of clients allowed */
#ifdef FD_SETSIZE
#define MAXCLIENTS	(FD_SETSIZE - 10)
#else
#define MAXCLIENTS	1024
#endif

#define DEFAULT_NXACTS	10		/* default nxacts */

int			nxacts = 0;			/* number of transactions per client */
int			duration = 0;		/* duration in seconds */

/*
 * scaling factor. for example, scale = 10 will make 1000000 tuples in
 * pgbench_accounts table.
 */
int			scale = 1;

/*
 * fillfactor. for example, fillfactor = 90 will use only 90 percent
 * space during inserts and leave 10 percent free.
 */
int			fillfactor = 100;

/*
 * create foreign key constraints on the tables?
 */
int			foreign_keys = 0;

/*
 * use unlogged tables?
 */
int			unlogged_tables = 0;

/*
 * tablespace selection
 */
char	   *tablespace = NULL;
char	   *index_tablespace = NULL;

/*
 * end of configurable parameters
 *********************************************************************/

#define nbranches	1			/* Makes little sense to change this.  Change
								 * -s instead */
#define ntellers	10
#define naccounts	100000

bool		use_log;			/* log transaction latencies to a file */
bool		is_connect;			/* establish connection for each transaction */
bool		is_latencies;		/* report per-command latencies */
int			main_pid;			/* main process id used in log filename */

char	   *pghost = "";
char	   *pgport = "";
char	   *login = NULL;
char	   *dbName;
const char *progname;

volatile bool timer_exceeded = false;	/* flag from signal handler */

/* variable definitions */
typedef struct
{
	char	   *name;			/* variable name */
	char	   *value;			/* its value */
} Variable;

#define MAX_FILES		128		/* max number of SQL script files allowed */
#define SHELL_COMMAND_SIZE	256 /* maximum size allowed for shell command */

/*
 * structures used in custom query mode
 */

typedef struct
{
	PGconn	   *con;			/* connection handle to DB */
	int			id;				/* client No. */
	int			state;			/* state No. */
	int			cnt;			/* xacts count */
	int			ecnt;			/* error count */
	int			listen;			/* 0 indicates that an async query has been
								 * sent */
	int			sleeping;		/* 1 indicates that the client is napping */
	int64		until;			/* napping until (usec) */
	Variable   *variables;		/* array of variable definitions */
	int			nvariables;
	instr_time	txn_begin;		/* used for measuring transaction latencies */
	instr_time	stmt_begin;		/* used for measuring statement latencies */
	int			use_file;		/* index in sql_files for this client */
	bool		prepared[MAX_FILES];
} CState;

/*
 * Thread state and result
 */
typedef struct
{
	int			tid;			/* thread id */
	pthread_t	thread;			/* thread handle */
	CState	   *state;			/* array of CState */
	int			nstate;			/* length of state[] */
	instr_time	start_time;		/* thread start time */
	instr_time *exec_elapsed;	/* time spent executing cmds (per Command) */
	int		   *exec_count;		/* number of cmd executions (per Command) */
	unsigned short random_state[3];		/* separate randomness for each thread */
} TState;

#define INVALID_THREAD		((pthread_t) 0)

typedef struct
{
	instr_time	conn_time;
	int			xacts;
} TResult;

/*
 * queries read from files
 */
#define SQL_COMMAND		1
#define META_COMMAND	2
#define MAX_ARGS		10

typedef enum QueryMode
{
	QUERY_SIMPLE,				/* simple query */
	QUERY_EXTENDED,				/* extended query */
	QUERY_PREPARED,				/* extended query with prepared statements */
	NUM_QUERYMODE
} QueryMode;

static QueryMode querymode = QUERY_SIMPLE;
static const char *QUERYMODE[] = {"simple", "extended", "prepared"};

typedef struct
{
	char	   *line;			/* full text of command line */
	int			command_num;	/* unique index of this Command struct */
	int			type;			/* command type (SQL_COMMAND or META_COMMAND) */
	int			argc;			/* number of command words */
	char	   *argv[MAX_ARGS]; /* command word list */
} Command;

static Command **sql_files[MAX_FILES];	/* SQL script files */
static int	num_files;			/* number of script files */
static int	num_commands = 0;	/* total number of Command structs */
static int	debug = 0;			/* debug flag */

/* default scenario */
static char *tpc_b = {
	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"\\setrandom bid 1 :nbranches\n"
	"\\setrandom tid 1 :ntellers\n"
	"\\setrandom delta -5000 5000\n"
	"BEGIN;\n"
	"UPDATE pgbench_accounts SET abalance = abalance + :delta,"
	"filler = \'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ :delta\',"
	"filler1 = random_text_md5_v2(100) WHERE aid = :aid;\n"
	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta,"
	"filler = \'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ :delta\'"
	"filler1 = random_text_md5_v2(100) WHERE tid = :tid;\n"
	"UPDATE pgbench_branches SET bbalance = bbalance + :delta,"
	"filler = \'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ :delta\'"
	"filler1 = random_text_md5_v2(100) WHERE bid = :bid;\n"
	"END;\n"
};

/* -N case */
static char *simple_update = {
	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"\\setrandom bid 1 :nbranches\n"
	"\\setrandom tid 1 :ntellers\n"
	"\\setrandom delta -5000 5000\n"
	"BEGIN;\n"
	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
	"END;\n"
};

/* -S case */
static char *select_only = {
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
};

/* Function prototypes */
static void setalarm(int seconds);
static void *threadRun(void *arg);


/*
 * routines to check mem allocations and fail noisily.
 */
static void *
xmalloc(size_t size)
{
	void	   *result;

	result = malloc(size);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}

static void *
xrealloc(void *ptr, size_t size)
{
	void	   *result;

	result = realloc(ptr, size);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}

static char *
xstrdup(const char *s)
{
	char	   *result;

	result = strdup(s);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}


static void
usage(void)
{
	printf("%s is a benchmarking tool for PostgreSQL.\n\n"
		   "Usage:\n"
		   "  %s [OPTION]... [DBNAME]\n"
		   "\nInitialization options:\n"
		   "  -i           invokes initialization mode\n"
		   "  -n           do not run VACUUM after initialization\n"
		   "  -F NUM       fill factor\n"
		   "  -s NUM       scaling factor\n"
		   "  --foreign-keys\n"
		   "               create foreign key constraints between tables\n"
		   "  --index-tablespace=TABLESPACE\n"
		   "               create indexes in the specified tablespace\n"
		   "  --tablespace=TABLESPACE\n"
		   "               create tables in the specified tablespace\n"
		   "  --unlogged-tables\n"
		   "               create tables as unlogged tables\n"
		   "\nBenchmarking options:\n"
		"  -c NUM       number of concurrent database clients (default: 1)\n"
		   "  -C           establish new connection for each transaction\n"
		   "  -D VARNAME=VALUE\n"
		   "               define variable for use by custom script\n"
		   "  -f FILENAME  read transaction script from FILENAME\n"
		   "  -j NUM       number of threads (default: 1)\n"
		   "  -l           write transaction times to log file\n"
		   "  -M simple|extended|prepared\n"
		   "               protocol for submitting queries to server (default: simple)\n"
		   "  -n           do not run VACUUM before tests\n"
		   "  -N           do not update tables \"pgbench_tellers\" and \"pgbench_branches\"\n"
		   "  -r           report average latency per command\n"
		   "  -s NUM       report this scale factor in output\n"
		   "  -S           perform SELECT-only transactions\n"
	 "  -t NUM       number of transactions each client runs (default: 10)\n"
		   "  -T NUM       duration of benchmark test in seconds\n"
		   "  -v           vacuum all four standard tables before tests\n"
		   "\nCommon options:\n"
		   "  -d             print debugging output\n"
		   "  -h HOSTNAME    database server host or socket directory\n"
		   "  -p PORT        database server port number\n"
		   "  -U USERNAME    connect as specified database user\n"
		   "  -V, --version  output version information, then exit\n"
		   "  -?, --help     show this help, then exit\n"
		   "\n"
		   "Report bugs to <pgsql-bugs@postgresql.org>.\n",
		   progname, progname);
}

/* random number generator: uniform distribution from min to max inclusive */
static int
getrand(TState *thread, int min, int max)
{
	/*
	 * Odd coding is so that min and max have approximately the same chance of
	 * being selected as do numbers between them.
	 *
	 * pg_erand48() is thread-safe and concurrent, which is why we use it
	 * rather than random(), which in glibc is non-reentrant, and therefore
	 * protected by a mutex, and therefore a bottleneck on machines with many
	 * CPUs.
	 */
	return min + (int) ((max - min + 1) * pg_erand48(thread->random_state));
}

/* call PQexec() and exit() on failure */
static void
executeStatement(PGconn *con, const char *sql)
{
	PGresult   *res;

	res = PQexec(con, sql);
	if (PQresultStatus(res) != PGRES_COMMAND_OK)
	{
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}
	PQclear(res);
}

/* set up a connection to the backend */
static PGconn *
doConnect(void)
{
	PGconn	   *conn;
	static char *password = NULL;
	bool		new_pass;

	/*
	 * Start the connection.  Loop until we have a password if requested by
	 * backend.
	 */
	do
	{
#define PARAMS_ARRAY_SIZE	7

		const char *keywords[PARAMS_ARRAY_SIZE];
		const char *values[PARAMS_ARRAY_SIZE];

		keywords[0] = "host";
		values[0] = pghost;
		keywords[1] = "port";
		values[1] = pgport;
		keywords[2] = "user";
		values[2] = login;
		keywords[3] = "password";
		values[3] = password;
		keywords[4] = "dbname";
		values[4] = dbName;
		keywords[5] = "fallback_application_name";
		values[5] = progname;
		keywords[6] = NULL;
		values[6] = NULL;

		new_pass = false;

		conn = PQconnectdbParams(keywords, values, true);

		if (!conn)
		{
			fprintf(stderr, "Connection to database \"%s\" failed\n",
					dbName);
			return NULL;
		}

		if (PQstatus(conn) == CONNECTION_BAD &&
			PQconnectionNeedsPassword(conn) &&
			password == NULL)
		{
			PQfinish(conn);
			password = simple_prompt("Password: ", 100, false);
			new_pass = true;
		}
	} while (new_pass);

	/* check to see that the backend connection was successfully made */
	if (PQstatus(conn) == CONNECTION_BAD)
	{
		fprintf(stderr, "Connection to database \"%s\" failed:\n%s",
				dbName, PQerrorMessage(conn));
		PQfinish(conn);
		return NULL;
	}

	return conn;
}

/* throw away response from backend */
static void
discard_response(CState *state)
{
	PGresult   *res;

	do
	{
		res = PQgetResult(state->con);
		if (res)
			PQclear(res);
	} while (res);
}

static int
compareVariables(const void *v1, const void *v2)
{
	return strcmp(((const Variable *) v1)->name,
				  ((const Variable *) v2)->name);
}

static char *
getVariable(CState *st, char *name)
{
	Variable	key,
			   *var;

	/* On some versions of Solaris, bsearch of zero items dumps core */
	if (st->nvariables <= 0)
		return NULL;

	key.name = name;
	var = (Variable *) bsearch((void *) &key,
							   (void *) st->variables,
							   st->nvariables,
							   sizeof(Variable),
							   compareVariables);
	if (var != NULL)
		return var->value;
	else
		return NULL;
}

/* check whether the name consists of alphabets, numerals and underscores. */
static bool
isLegalVariableName(const char *name)
{
	int			i;

	for (i = 0; name[i] != '\0'; i++)
	{
		if (!isalnum((unsigned char) name[i]) && name[i] != '_')
			return false;
	}

	return true;
}

static int
putVariable(CState *st, const char *context, char *name, char *value)
{
	Variable	key,
			   *var;

	key.name = name;
	/* On some versions of Solaris, bsearch of zero items dumps core */
	if (st->nvariables > 0)
		var = (Variable *) bsearch((void *) &key,
								   (void *) st->variables,
								   st->nvariables,
								   sizeof(Variable),
								   compareVariables);
	else
		var = NULL;

	if (var == NULL)
	{
		Variable   *newvars;

		/*
		 * Check for the name only when declaring a new variable to avoid
		 * overhead.
		 */
		if (!isLegalVariableName(name))
		{
			fprintf(stderr, "%s: invalid variable name '%s'\n", context, name);
			return false;
		}

		if (st->variables)
			newvars = (Variable *) xrealloc(st->variables,
									(st->nvariables + 1) * sizeof(Variable));
		else
			newvars = (Variable *) xmalloc(sizeof(Variable));

		st->variables = newvars;

		var = &newvars[st->nvariables];

		var->name = xstrdup(name);
		var->value = xstrdup(value);

		st->nvariables++;

		qsort((void *) st->variables, st->nvariables, sizeof(Variable),
			  compareVariables);
	}
	else
	{
		char	   *val;

		/* dup then free, in case value is pointing at this variable */
		val = xstrdup(value);

		free(var->value);
		var->value = val;
	}

	return true;
}

static char *
parseVariable(const char *sql, int *eaten)
{
	int			i = 0;
	char	   *name;

	do
	{
		i++;
	} while (isalnum((unsigned char) sql[i]) || sql[i] == '_');
	if (i == 1)
		return NULL;

	name = xmalloc(i);
	memcpy(name, &sql[1], i - 1);
	name[i - 1] = '\0';

	*eaten = i;
	return name;
}

static char *
replaceVariable(char **sql, char *param, int len, char *value)
{
	int			valueln = strlen(value);

	if (valueln > len)
	{
		size_t		offset = param - *sql;

		*sql = xrealloc(*sql, strlen(*sql) - len + valueln + 1);
		param = *sql + offset;
	}

	if (valueln != len)
		memmove(param + valueln, param + len, strlen(param + len) + 1);
	strncpy(param, value, valueln);

	return param + valueln;
}

static char *
assignVariables(CState *st, char *sql)
{
	char	   *p,
			   *name,
			   *val;

	p = sql;
	while ((p = strchr(p, ':')) != NULL)
	{
		int			eaten;

		name = parseVariable(p, &eaten);
		if (name == NULL)
		{
			while (*p == ':')
			{
				p++;
			}
			continue;
		}

		val = getVariable(st, name);
		free(name);
		if (val == NULL)
		{
			p++;
			continue;
		}

		p = replaceVariable(&sql, p, eaten, val);
	}

	return sql;
}

static void
getQueryParams(CState *st, const Command *command, const char **params)
{
	int			i;

	for (i = 0; i < command->argc - 1; i++)
		params[i] = getVariable(st, command->argv[i + 1]);
}

/*
 * Run a shell command. The result is assigned to the variable if not NULL.
 * Return true if succeeded, or false on error.
 */
static bool
runShellCommand(CState *st, char *variable, char **argv, int argc)
{
	char		command[SHELL_COMMAND_SIZE];
	int			i,
				len = 0;
	FILE	   *fp;
	char		res[64];
	char	   *endptr;
	int			retval;

	/*----------
	 * Join arguments with whitespace separators. Arguments starting with
	 * exactly one colon are treated as variables:
	 *	name - append a string "name"
	 *	:var - append a variable named 'var'
	 *	::name - append a string ":name"
	 *----------
	 */
	for (i = 0; i < argc; i++)
	{
		char	   *arg;
		int			arglen;

		if (argv[i][0] != ':')
		{
			arg = argv[i];		/* a string literal */
		}
		else if (argv[i][1] == ':')
		{
			arg = argv[i] + 1;	/* a string literal starting with colons */
		}
		else if ((arg = getVariable(st, argv[i] + 1)) == NULL)
		{
			fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[i]);
			return false;
		}

		arglen = strlen(arg);
		if (len + arglen + (i > 0 ? 1 : 0) >= SHELL_COMMAND_SIZE - 1)
		{
			fprintf(stderr, "%s: too long shell command\n", argv[0]);
			return false;
		}

		if (i > 0)
			command[len++] = ' ';
		memcpy(command + len, arg, arglen);
		len += arglen;
	}

	command[len] = '\0';

	/* Fast path for non-assignment case */
	if (variable == NULL)
	{
		if (system(command))
		{
			if (!timer_exceeded)
				fprintf(stderr, "%s: cannot launch shell command\n", argv[0]);
			return false;
		}
		return true;
	}

	/* Execute the command with pipe and read the standard output. */
	if ((fp = popen(command, "r")) == NULL)
	{
		fprintf(stderr, "%s: cannot launch shell command\n", argv[0]);
		return false;
	}
	if (fgets(res, sizeof(res), fp) == NULL)
	{
		if (!timer_exceeded)
			fprintf(stderr, "%s: cannot read the result\n", argv[0]);
		return false;
	}
	if (pclose(fp) < 0)
	{
		fprintf(stderr, "%s: cannot close shell command\n", argv[0]);
		return false;
	}

	/* Check whether the result is an integer and assign it to the variable */
	retval = (int) strtol(res, &endptr, 10);
	while (*endptr != '\0' && isspace((unsigned char) *endptr))
		endptr++;
	if (*res == '\0' || *endptr != '\0')
	{
		fprintf(stderr, "%s: must return an integer ('%s' returned)\n", argv[0], res);
		return false;
	}
	snprintf(res, sizeof(res), "%d", retval);
	if (!putVariable(st, "setshell", variable, res))
		return false;

#ifdef DEBUG
	printf("shell parameter name: %s, value: %s\n", argv[1], res);
#endif
	return true;
}

#define MAX_PREPARE_NAME		32
static void
preparedStatementName(char *buffer, int file, int state)
{
	sprintf(buffer, "P%d_%d", file, state);
}

static bool
clientDone(CState *st, bool ok)
{
	(void) ok;					/* unused */

	if (st->con != NULL)
	{
		PQfinish(st->con);
		st->con = NULL;
	}
	return false;				/* always false */
}

/* return false iff client should be disconnected */
static bool
doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile)
{
	PGresult   *res;
	Command   **commands;

top:
	commands = sql_files[st->use_file];

	if (st->sleeping)
	{							/* are we sleeping? */
		instr_time	now;

		INSTR_TIME_SET_CURRENT(now);
		if (st->until <= INSTR_TIME_GET_MICROSEC(now))
			st->sleeping = 0;	/* Done sleeping, go ahead with next command */
		else
			return true;		/* Still sleeping, nothing to do here */
	}

	if (st->listen)
	{							/* are we receiver? */
		if (commands[st->state]->type == SQL_COMMAND)
		{
			if (debug)
				fprintf(stderr, "client %d receiving\n", st->id);
			if (!PQconsumeInput(st->con))
			{					/* there's something wrong */
				fprintf(stderr, "Client %d aborted in state %d. Probably the backend died while processing.\n", st->id, st->state);
				return clientDone(st, false);
			}
			if (PQisBusy(st->con))
				return true;	/* don't have the whole result yet */
		}

		/*
		 * command finished: accumulate per-command execution times in
		 * thread-local data structure, if per-command latencies are requested
		 */
		if (is_latencies)
		{
			instr_time	now;
			int			cnum = commands[st->state]->command_num;

			INSTR_TIME_SET_CURRENT(now);
			INSTR_TIME_ACCUM_DIFF(thread->exec_elapsed[cnum],
								  now, st->stmt_begin);
			thread->exec_count[cnum]++;
		}

		/*
		 * if transaction finished, record the time it took in the log
		 */
		if (logfile && commands[st->state + 1] == NULL)
		{
			instr_time	now;
			instr_time	diff;
			double		usec;

			INSTR_TIME_SET_CURRENT(now);
			diff = now;
			INSTR_TIME_SUBTRACT(diff, st->txn_begin);
			usec = (double) INSTR_TIME_GET_MICROSEC(diff);

#ifndef WIN32
			/* This is more than we really ought to know about instr_time */
			fprintf(logfile, "%d %d %.0f %d %ld %ld\n",
					st->id, st->cnt, usec, st->use_file,
					(long) now.tv_sec, (long) now.tv_usec);
#else
			/* On Windows, instr_time doesn't provide a timestamp anyway */
			fprintf(logfile, "%d %d %.0f %d 0 0\n",
					st->id, st->cnt, usec, st->use_file);
#endif
		}

		if (commands[st->state]->type == SQL_COMMAND)
		{
			/*
			 * Read and discard the query result; note this is not included in
			 * the statement latency numbers.
			 */
			res = PQgetResult(st->con);
			switch (PQresultStatus(res))
			{
				case PGRES_COMMAND_OK:
				case PGRES_TUPLES_OK:
					break;		/* OK */
				default:
					fprintf(stderr, "Client %d aborted in state %d: %s",
							st->id, st->state, PQerrorMessage(st->con));
					PQclear(res);
					return clientDone(st, false);
			}
			PQclear(res);
			discard_response(st);
		}

		if (commands[st->state + 1] == NULL)
		{
			if (is_connect)
			{
				PQfinish(st->con);
				st->con = NULL;
			}

			++st->cnt;
			if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
				return clientDone(st, true);	/* exit success */
		}

		/* increment state counter */
		st->state++;
		if (commands[st->state] == NULL)
		{
			st->state = 0;
			st->use_file = getrand(thread, 0, num_files - 1);
			commands = sql_files[st->use_file];
		}
	}

	if (st->con == NULL)
	{
		instr_time	start,
					end;

		INSTR_TIME_SET_CURRENT(start);
		if ((st->con = doConnect()) == NULL)
		{
			fprintf(stderr, "Client %d aborted in establishing connection.\n", st->id);
			return clientDone(st, false);
		}
		INSTR_TIME_SET_CURRENT(end);
		INSTR_TIME_ACCUM_DIFF(*conn_time, end, start);
	}

	/* Record transaction start time if logging is enabled */
	if (logfile && st->state == 0)
		INSTR_TIME_SET_CURRENT(st->txn_begin);

	/* Record statement start time if per-command latencies are requested */
	if (is_latencies)
		INSTR_TIME_SET_CURRENT(st->stmt_begin);

	if (commands[st->state]->type == SQL_COMMAND)
	{
		const Command *command = commands[st->state];
		int			r;

		if (querymode == QUERY_SIMPLE)
		{
			char	   *sql;

			sql = xstrdup(command->argv[0]);
			sql = assignVariables(st, sql);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, sql);
			r = PQsendQuery(st->con, sql);
			free(sql);
		}
		else if (querymode == QUERY_EXTENDED)
		{
			const char *sql = command->argv[0];
			const char *params[MAX_ARGS];

			getQueryParams(st, command, params);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, sql);
			r = PQsendQueryParams(st->con, sql, command->argc - 1,
								  NULL, params, NULL, NULL, 0);
		}
		else if (querymode == QUERY_PREPARED)
		{
			char		name[MAX_PREPARE_NAME];
			const char *params[MAX_ARGS];

			if (!st->prepared[st->use_file])
			{
				int			j;

				for (j = 0; commands[j] != NULL; j++)
				{
					PGresult   *res;
					char		name[MAX_PREPARE_NAME];

					if (commands[j]->type != SQL_COMMAND)
						continue;
					preparedStatementName(name, st->use_file, j);
					res = PQprepare(st->con, name,
						  commands[j]->argv[0], commands[j]->argc - 1, NULL);
					if (PQresultStatus(res) != PGRES_COMMAND_OK)
						fprintf(stderr, "%s", PQerrorMessage(st->con));
					PQclear(res);
				}
				st->prepared[st->use_file] = true;
			}

			getQueryParams(st, command, params);
			preparedStatementName(name, st->use_file, st->state);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, name);
			r = PQsendQueryPrepared(st->con, name, command->argc - 1,
									params, NULL, NULL, 0);
		}
		else	/* unknown sql mode */
			r = 0;

		if (r == 0)
		{
			if (debug)
				fprintf(stderr, "client %d cannot send %s\n", st->id, command->argv[0]);
			st->ecnt++;
		}
		else
			st->listen = 1;		/* flags that should be listened */
	}
	else if (commands[st->state]->type == META_COMMAND)
	{
		int			argc = commands[st->state]->argc,
					i;
		char	  **argv = commands[st->state]->argv;

		if (debug)
		{
			fprintf(stderr, "client %d executing \\%s", st->id, argv[0]);
			for (i = 1; i < argc; i++)
				fprintf(stderr, " %s", argv[i]);
			fprintf(stderr, "\n");
		}

		if (pg_strcasecmp(argv[0], "setrandom") == 0)
		{
			char	   *var;
			int			min,
						max;
			char		res[64];

			if (*argv[2] == ':')
			{
				if ((var = getVariable(st, argv[2] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[2]);
					st->ecnt++;
					return true;
				}
				min = atoi(var);
			}
			else
				min = atoi(argv[2]);

#ifdef NOT_USED
			if (min < 0)
			{
				fprintf(stderr, "%s: invalid minimum number %d\n", argv[0], min);
				st->ecnt++;
				return;
			}
#endif

			if (*argv[3] == ':')
			{
				if ((var = getVariable(st, argv[3] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[3]);
					st->ecnt++;
					return true;
				}
				max = atoi(var);
			}
			else
				max = atoi(argv[3]);

			if (max < min)
			{
				fprintf(stderr, "%s: maximum is less than minimum\n", argv[0]);
				st->ecnt++;
				return true;
			}

			/*
			 * getrand() neeeds to be able to subtract max from min and add
			 * one the result without overflowing.	Since we know max > min,
			 * we can detect overflow just by checking for a negative result.
			 * But we must check both that the subtraction doesn't overflow,
			 * and that adding one to the result doesn't overflow either.
			 */
			if (max - min < 0 || (max - min) + 1 < 0)
			{
				fprintf(stderr, "%s: range too large\n", argv[0]);
				st->ecnt++;
				return true;
			}

#ifdef DEBUG
			printf("min: %d max: %d random: %d\n", min, max, getrand(thread, min, max));
#endif
			snprintf(res, sizeof(res), "%d", getrand(thread, min, max));

			if (!putVariable(st, argv[0], argv[1], res))
			{
				st->ecnt++;
				return true;
			}

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "set") == 0)
		{
			char	   *var;
			int			ope1,
						ope2;
			char		res[64];

			if (*argv[2] == ':')
			{
				if ((var = getVariable(st, argv[2] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[2]);
					st->ecnt++;
					return true;
				}
				ope1 = atoi(var);
			}
			else
				ope1 = atoi(argv[2]);

			if (argc < 5)
				snprintf(res, sizeof(res), "%d", ope1);
			else
			{
				if (*argv[4] == ':')
				{
					if ((var = getVariable(st, argv[4] + 1)) == NULL)
					{
						fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[4]);
						st->ecnt++;
						return true;
					}
					ope2 = atoi(var);
				}
				else
					ope2 = atoi(argv[4]);

				if (strcmp(argv[3], "+") == 0)
					snprintf(res, sizeof(res), "%d", ope1 + ope2);
				else if (strcmp(argv[3], "-") == 0)
					snprintf(res, sizeof(res), "%d", ope1 - ope2);
				else if (strcmp(argv[3], "*") == 0)
					snprintf(res, sizeof(res), "%d", ope1 * ope2);
				else if (strcmp(argv[3], "/") == 0)
				{
					if (ope2 == 0)
					{
						fprintf(stderr, "%s: division by zero\n", argv[0]);
						st->ecnt++;
						return true;
					}
					snprintf(res, sizeof(res), "%d", ope1 / ope2);
				}
				else
				{
					fprintf(stderr, "%s: unsupported operator %s\n", argv[0], argv[3]);
					st->ecnt++;
					return true;
				}
			}

			if (!putVariable(st, argv[0], argv[1], res))
			{
				st->ecnt++;
				return true;
			}

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "sleep") == 0)
		{
			char	   *var;
			int			usec;
			instr_time	now;

			if (*argv[1] == ':')
			{
				if ((var = getVariable(st, argv[1] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[1]);
					st->ecnt++;
					return true;
				}
				usec = atoi(var);
			}
			else
				usec = atoi(argv[1]);

			if (argc > 2)
			{
				if (pg_strcasecmp(argv[2], "ms") == 0)
					usec *= 1000;
				else if (pg_strcasecmp(argv[2], "s") == 0)
					usec *= 1000000;
			}
			else
				usec *= 1000000;

			INSTR_TIME_SET_CURRENT(now);
			st->until = INSTR_TIME_GET_MICROSEC(now) + usec;
			st->sleeping = 1;

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "setshell") == 0)
		{
			bool		ret = runShellCommand(st, argv[1], argv + 2, argc - 2);

			if (timer_exceeded) /* timeout */
				return clientDone(st, true);
			else if (!ret)		/* on error */
			{
				st->ecnt++;
				return true;
			}
			else	/* succeeded */
				st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "shell") == 0)
		{
			bool		ret = runShellCommand(st, NULL, argv + 1, argc - 1);

			if (timer_exceeded) /* timeout */
				return clientDone(st, true);
			else if (!ret)		/* on error */
			{
				st->ecnt++;
				return true;
			}
			else	/* succeeded */
				st->listen = 1;
		}
		goto top;
	}

	return true;
}

/* discard connections */
static void
disconnect_all(CState *state, int length)
{
	int			i;

	for (i = 0; i < length; i++)
	{
		if (state[i].con)
		{
			PQfinish(state[i].con);
			state[i].con = NULL;
		}
	}
}

/* create tables and setup data */
static void
init(bool is_no_vacuum)
{
	/*
	 * Note: TPC-B requires at least 100 bytes per row, and the "filler"
	 * fields in these table declarations were intended to comply with that.
	 * But because they default to NULLs, they don't actually take any space.
	 * We could fix that by giving them non-null default values. However, that
	 * would completely break comparability of pgbench results with prior
	 * versions.  Since pgbench has never pretended to be fully TPC-B
	 * compliant anyway, we stick with the historical behavior.
	 */
	struct ddlinfo
	{
		char	   *table;
		char	   *cols;
		int			declare_fillfactor;
	};
	struct ddlinfo DDLs[] = {
		{
			"pgbench_history",
			"tid int,bid int,aid int,delta int,mtime timestamp,filler char(22)",
			0
		},
		{
			"pgbench_tellers",
			"tid int not null,bid int,tbalance int,filler char(92),"
			"tbalance1 int, filler1 varchar(152),tbalance2 int,filler2 char(1550)",
			1
		},
		{
			"pgbench_accounts",
			"aid int not null,bid int,abalance int,filler char(92),"
			"abalance1 int,filler1 varchar(152),abalance2 int,filler2 char(1550)",
			1
		},
		{
			"pgbench_branches",
			"bid int not null,bbalance int,filler char(92),bbalance1 int,"
			"filler1 varchar(152), bbalance2 int, filler2 char(1550)",
			1
		}
	};
	static char *DDLAFTERs[] = {
		"alter table pgbench_branches add primary key (bid)",
		"alter table pgbench_tellers add primary key (tid)",
		"alter table pgbench_accounts add primary key (aid)"
	};
	static char *DDLKEYs[] = {
		"alter table pgbench_tellers add foreign key (bid) references pgbench_branches",
		"alter table pgbench_accounts add foreign key (bid) references pgbench_branches",
		"alter table pgbench_history add foreign key (bid) references pgbench_branches",
		"alter table pgbench_history add foreign key (tid) references pgbench_tellers",
		"alter table pgbench_history add foreign key (aid) references pgbench_accounts"
	};

	PGconn	   *con;
	PGresult   *res;
	char		sql[256];
	int			i;

	if ((con = doConnect()) == NULL)
		exit(1);

	for (i = 0; i < lengthof(DDLs); i++)
	{
		char		opts[256];
		char		buffer[256];
		struct ddlinfo *ddl = &DDLs[i];

		/* Remove old table, if it exists. */
		snprintf(buffer, 256, "drop table if exists %s", ddl->table);
		executeStatement(con, buffer);

		/* Construct new create table statement. */
		opts[0] = '\0';
		if (ddl->declare_fillfactor)
			snprintf(opts + strlen(opts), 256 - strlen(opts),
					 " with (fillfactor=%d)", fillfactor);
		if (tablespace != NULL)
		{
			char	   *escape_tablespace;

			escape_tablespace = PQescapeIdentifier(con, tablespace,
												   strlen(tablespace));
			snprintf(opts + strlen(opts), 256 - strlen(opts),
					 " tablespace %s", escape_tablespace);
			PQfreemem(escape_tablespace);
		}
		snprintf(buffer, 256, "create%s table %s(%s)%s",
				 unlogged_tables ? " unlogged" : "",
				 ddl->table, ddl->cols, opts);

		executeStatement(con, buffer);
	}

	executeStatement(con, "begin");

	for (i = 0; i < nbranches * scale; i++)
	{
		snprintf(sql, 256, "insert into pgbench_branches values(%d,0,0,0,0,0,0)", i + 1);
		executeStatement(con, sql);
	}

	for (i = 0; i < ntellers * scale; i++)
	{
		snprintf(sql, 256, "insert into pgbench_tellers values (%d,%d,0,0,0,0,0,0)",
				 i + 1, i / ntellers + 1);
		executeStatement(con, sql);
	}

	executeStatement(con, "commit");

	/*
	 * fill the pgbench_accounts table with some data
	 */
	fprintf(stderr, "creating tables...\n");

	executeStatement(con, "begin");
	executeStatement(con, "truncate pgbench_accounts");

	res = PQexec(con, "copy pgbench_accounts from stdin");
	if (PQresultStatus(res) != PGRES_COPY_IN)
	{
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}
	PQclear(res);

	for (i = 0; i < naccounts * scale; i++)
	{
		int			j = i + 1;

		snprintf(sql, 256, "%d\t%d\t%d\t \t%d\t \t%d\t \n", j, i / naccounts + 1, 0,0,0);
		if (PQputline(con, sql))
		{
			fprintf(stderr, "PQputline failed\n");
			exit(1);
		}

		if (j % 100000 == 0)
			fprintf(stderr, "%d of %d tuples (%d%%) done.\n",
					j, naccounts * scale,
					j * 100 / (naccounts * scale));
	}
	if (PQputline(con, "\\.\n"))
	{
		fprintf(stderr, "very last PQputline failed\n");
		exit(1);
	}
	if (PQendcopy(con))
	{
		fprintf(stderr, "PQendcopy failed\n");
		exit(1);
	}
	executeStatement(con, "commit");

	/* vacuum */
	if (!is_no_vacuum)
	{
		fprintf(stderr, "vacuum...\n");
		executeStatement(con, "vacuum analyze pgbench_branches");
		executeStatement(con, "vacuum analyze pgbench_tellers");
		executeStatement(con, "vacuum analyze pgbench_accounts");
		executeStatement(con, "vacuum analyze pgbench_history");
	}

	/*
	 * create indexes
	 */
	fprintf(stderr, "set primary keys...\n");
	for (i = 0; i < lengthof(DDLAFTERs); i++)
	{
		char		buffer[256];

		strncpy(buffer, DDLAFTERs[i], 256);

		if (index_tablespace != NULL)
		{
			char	   *escape_tablespace;

			escape_tablespace = PQescapeIdentifier(con, index_tablespace,
												   strlen(index_tablespace));
			snprintf(buffer + strlen(buffer), 256 - strlen(buffer),
					 " using index tablespace %s", escape_tablespace);
			PQfreemem(escape_tablespace);
		}

		executeStatement(con, buffer);
	}

	/*
	 * create foreign keys
	 */
	if (foreign_keys)
	{
		fprintf(stderr, "set foreign keys...\n");
		for (i = 0; i < lengthof(DDLKEYs); i++)
		{
			executeStatement(con, DDLKEYs[i]);
		}
	}


	fprintf(stderr, "done.\n");
	PQfinish(con);
}

/*
 * Parse the raw sql and replace :param to $n.
 */
static bool
parseQuery(Command *cmd, const char *raw_sql)
{
	char	   *sql,
			   *p;

	sql = xstrdup(raw_sql);
	cmd->argc = 1;

	p = sql;
	while ((p = strchr(p, ':')) != NULL)
	{
		char		var[12];
		char	   *name;
		int			eaten;

		name = parseVariable(p, &eaten);
		if (name == NULL)
		{
			while (*p == ':')
			{
				p++;
			}
			continue;
		}

		if (cmd->argc >= MAX_ARGS)
		{
			fprintf(stderr, "statement has too many arguments (maximum is %d): %s\n", MAX_ARGS - 1, raw_sql);
			return false;
		}

		sprintf(var, "$%d", cmd->argc);
		p = replaceVariable(&sql, p, eaten, var);

		cmd->argv[cmd->argc] = name;
		cmd->argc++;
	}

	cmd->argv[0] = sql;
	return true;
}

/* Parse a command; return a Command struct, or NULL if it's a comment */
static Command *
process_commands(char *buf)
{
	const char	delim[] = " \f\n\r\t\v";

	Command    *my_commands;
	int			j;
	char	   *p,
			   *tok;

	/* Make the string buf end at the next newline */
	if ((p = strchr(buf, '\n')) != NULL)
		*p = '\0';

	/* Skip leading whitespace */
	p = buf;
	while (isspace((unsigned char) *p))
		p++;

	/* If the line is empty or actually a comment, we're done */
	if (*p == '\0' || strncmp(p, "--", 2) == 0)
		return NULL;

	/* Allocate and initialize Command structure */
	my_commands = (Command *) xmalloc(sizeof(Command));
	my_commands->line = xstrdup(buf);
	my_commands->command_num = num_commands++;
	my_commands->type = 0;		/* until set */
	my_commands->argc = 0;

	if (*p == '\\')
	{
		my_commands->type = META_COMMAND;

		j = 0;
		tok = strtok(++p, delim);

		while (tok != NULL)
		{
			my_commands->argv[j++] = xstrdup(tok);
			my_commands->argc++;
			tok = strtok(NULL, delim);
		}

		if (pg_strcasecmp(my_commands->argv[0], "setrandom") == 0)
		{
			if (my_commands->argc < 4)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			for (j = 4; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
		{
			if (my_commands->argc < 3)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			for (j = my_commands->argc < 5 ? 3 : 5; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "sleep") == 0)
		{
			if (my_commands->argc < 2)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			/*
			 * Split argument into number and unit to allow "sleep 1ms" etc.
			 * We don't have to terminate the number argument with null
			 * because it will be parsed with atoi, which ignores trailing
			 * non-digit characters.
			 */
			if (my_commands->argv[1][0] != ':')
			{
				char	   *c = my_commands->argv[1];

				while (isdigit((unsigned char) *c))
					c++;
				if (*c)
				{
					my_commands->argv[2] = c;
					if (my_commands->argc < 3)
						my_commands->argc = 3;
				}
			}

			if (my_commands->argc >= 3)
			{
				if (pg_strcasecmp(my_commands->argv[2], "us") != 0 &&
					pg_strcasecmp(my_commands->argv[2], "ms") != 0 &&
					pg_strcasecmp(my_commands->argv[2], "s") != 0)
				{
					fprintf(stderr, "%s: unknown time unit '%s' - must be us, ms or s\n",
							my_commands->argv[0], my_commands->argv[2]);
					exit(1);
				}
			}

			for (j = 3; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "setshell") == 0)
		{
			if (my_commands->argc < 3)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}
		}
		else if (pg_strcasecmp(my_commands->argv[0], "shell") == 0)
		{
			if (my_commands->argc < 1)
			{
				fprintf(stderr, "%s: missing command\n", my_commands->argv[0]);
				exit(1);
			}
		}
		else
		{
			fprintf(stderr, "Invalid command %s\n", my_commands->argv[0]);
			exit(1);
		}
	}
	else
	{
		my_commands->type = SQL_COMMAND;

		switch (querymode)
		{
			case QUERY_SIMPLE:
				my_commands->argv[0] = xstrdup(p);
				my_commands->argc++;
				break;
			case QUERY_EXTENDED:
			case QUERY_PREPARED:
				if (!parseQuery(my_commands, p))
					exit(1);
				break;
			default:
				exit(1);
		}
	}

	return my_commands;
}

static int
process_file(char *filename)
{
#define COMMANDS_ALLOC_NUM 128

	Command   **my_commands;
	FILE	   *fd;
	int			lineno;
	char		buf[BUFSIZ];
	int			alloc_num;

	if (num_files >= MAX_FILES)
	{
		fprintf(stderr, "Up to only %d SQL files are allowed\n", MAX_FILES);
		exit(1);
	}

	alloc_num = COMMANDS_ALLOC_NUM;
	my_commands = (Command **) xmalloc(sizeof(Command *) * alloc_num);

	if (strcmp(filename, "-") == 0)
		fd = stdin;
	else if ((fd = fopen(filename, "r")) == NULL)
	{
		fprintf(stderr, "%s: %s\n", filename, strerror(errno));
		return false;
	}

	lineno = 0;

	while (fgets(buf, sizeof(buf), fd) != NULL)
	{
		Command    *command;

		command = process_commands(buf);
		if (command == NULL)
			continue;

		my_commands[lineno] = command;
		lineno++;

		if (lineno >= alloc_num)
		{
			alloc_num += COMMANDS_ALLOC_NUM;
			my_commands = xrealloc(my_commands, sizeof(Command *) * alloc_num);
		}
	}
	fclose(fd);

	my_commands[lineno] = NULL;

	sql_files[num_files++] = my_commands;

	return true;
}

static Command **
process_builtin(char *tb)
{
#define COMMANDS_ALLOC_NUM 128

	Command   **my_commands;
	int			lineno;
	char		buf[BUFSIZ];
	int			alloc_num;

	alloc_num = COMMANDS_ALLOC_NUM;
	my_commands = (Command **) xmalloc(sizeof(Command *) * alloc_num);

	lineno = 0;

	for (;;)
	{
		char	   *p;
		Command    *command;

		p = buf;
		while (*tb && *tb != '\n')
			*p++ = *tb++;

		if (*tb == '\0')
			break;

		if (*tb == '\n')
			tb++;

		*p = '\0';

		command = process_commands(buf);
		if (command == NULL)
			continue;

		my_commands[lineno] = command;
		lineno++;

		if (lineno >= alloc_num)
		{
			alloc_num += COMMANDS_ALLOC_NUM;
			my_commands = xrealloc(my_commands, sizeof(Command *) * alloc_num);
		}
	}

	my_commands[lineno] = NULL;

	return my_commands;
}

/* print out results */
static void
printResults(int ttype, int normal_xacts, int nclients,
			 TState *threads, int nthreads,
			 instr_time total_time, instr_time conn_total_time)
{
	double		time_include,
				tps_include,
				tps_exclude;
	char	   *s;

	time_include = INSTR_TIME_GET_DOUBLE(total_time);
	tps_include = normal_xacts / time_include;
	tps_exclude = normal_xacts / (time_include -
						(INSTR_TIME_GET_DOUBLE(conn_total_time) / nthreads));

	if (ttype == 0)
		s = "TPC-B (sort of)";
	else if (ttype == 2)
		s = "Update only pgbench_accounts";
	else if (ttype == 1)
		s = "SELECT only";
	else
		s = "Custom query";

	printf("transaction type: %s\n", s);
	printf("scaling factor: %d\n", scale);
	printf("query mode: %s\n", QUERYMODE[querymode]);
	printf("number of clients: %d\n", nclients);
	printf("number of threads: %d\n", nthreads);
	if (duration <= 0)
	{
		printf("number of transactions per client: %d\n", nxacts);
		printf("number of transactions actually processed: %d/%d\n",
			   normal_xacts, nxacts * nclients);
	}
	else
	{
		printf("duration: %d s\n", duration);
		printf("number of transactions actually processed: %d\n",
			   normal_xacts);
	}
	printf("tps = %f (including connections establishing)\n", tps_include);
	printf("tps = %f (excluding connections establishing)\n", tps_exclude);

	/* Report per-command latencies */
	if (is_latencies)
	{
		int			i;

		for (i = 0; i < num_files; i++)
		{
			Command   **commands;

			if (num_files > 1)
				printf("statement latencies in milliseconds, file %d:\n", i + 1);
			else
				printf("statement latencies in milliseconds:\n");

			for (commands = sql_files[i]; *commands != NULL; commands++)
			{
				Command    *command = *commands;
				int			cnum = command->command_num;
				double		total_time;
				instr_time	total_exec_elapsed;
				int			total_exec_count;
				int			t;

				/* Accumulate per-thread data for command */
				INSTR_TIME_SET_ZERO(total_exec_elapsed);
				total_exec_count = 0;
				for (t = 0; t < nthreads; t++)
				{
					TState	   *thread = &threads[t];

					INSTR_TIME_ADD(total_exec_elapsed,
								   thread->exec_elapsed[cnum]);
					total_exec_count += thread->exec_count[cnum];
				}

				if (total_exec_count > 0)
					total_time = INSTR_TIME_GET_MILLISEC(total_exec_elapsed) / (double) total_exec_count;
				else
					total_time = 0.0;

				printf("\t%f\t%s\n", total_time, command->line);
			}
		}
	}
}


int
main(int argc, char **argv)
{
	int			c;
	int			nclients = 1;	/* default number of simulated clients */
	int			nthreads = 1;	/* default number of threads */
	int			is_init_mode = 0;		/* initialize mode? */
	int			is_no_vacuum = 0;		/* no vacuum at all before testing? */
	int			do_vacuum_accounts = 0; /* do vacuum accounts before testing? */
	int			ttype = 0;		/* transaction type. 0: TPC-B, 1: SELECT only,
								 * 2: skip update of branches and tellers */
	int			optindex;
	char	   *filename = NULL;
	bool		scale_given = false;

	CState	   *state;			/* status of clients */
	TState	   *threads;		/* array of thread */

	instr_time	start_time;		/* start up time */
	instr_time	total_time;
	instr_time	conn_total_time;
	int			total_xacts;

	int			i;

	static struct option long_options[] = {
		{"foreign-keys", no_argument, &foreign_keys, 1},
		{"index-tablespace", required_argument, NULL, 3},
		{"tablespace", required_argument, NULL, 2},
		{"unlogged-tables", no_argument, &unlogged_tables, 1},
		{NULL, 0, NULL, 0}
	};

#ifdef HAVE_GETRLIMIT
	struct rlimit rlim;
#endif

	PGconn	   *con;
	PGresult   *res;
	char	   *env;

	char		val[64];

	progname = get_progname(argv[0]);

	if (argc > 1)
	{
		if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-?") == 0)
		{
			usage();
			exit(0);
		}
		if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
		{
			puts("pgbench (PostgreSQL) " PG_VERSION);
			exit(0);
		}
	}

#ifdef WIN32
	/* stderr is buffered on Win32. */
	setvbuf(stderr, NULL, _IONBF, 0);
#endif

	if ((env = getenv("PGHOST")) != NULL && *env != '\0')
		pghost = env;
	if ((env = getenv("PGPORT")) != NULL && *env != '\0')
		pgport = env;
	else if ((env = getenv("PGUSER")) != NULL && *env != '\0')
		login = env;

	state = (CState *) xmalloc(sizeof(CState));
	memset(state, 0, sizeof(CState));

	while ((c = getopt_long(argc, argv, "ih:nvp:dSNc:j:Crs:t:T:U:lf:D:F:M:", long_options, &optindex)) != -1)
	{
		switch (c)
		{
			case 'i':
				is_init_mode++;
				break;
			case 'h':
				pghost = optarg;
				break;
			case 'n':
				is_no_vacuum++;
				break;
			case 'v':
				do_vacuum_accounts++;
				break;
			case 'p':
				pgport = optarg;
				break;
			case 'd':
				debug++;
				break;
			case 'S':
				ttype = 1;
				break;
			case 'N':
				ttype = 2;
				break;
			case 'c':
				nclients = atoi(optarg);
				if (nclients <= 0 || nclients > MAXCLIENTS)
				{
					fprintf(stderr, "invalid number of clients: %d\n", nclients);
					exit(1);
				}
#ifdef HAVE_GETRLIMIT
#ifdef RLIMIT_NOFILE			/* most platforms use RLIMIT_NOFILE */
				if (getrlimit(RLIMIT_NOFILE, &rlim) == -1)
#else							/* but BSD doesn't ... */
				if (getrlimit(RLIMIT_OFILE, &rlim) == -1)
#endif   /* RLIMIT_NOFILE */
				{
					fprintf(stderr, "getrlimit failed: %s\n", strerror(errno));
					exit(1);
				}
				if (rlim.rlim_cur <= (nclients + 2))
				{
					fprintf(stderr, "You need at least %d open files but you are only allowed to use %ld.\n", nclients + 2, (long) rlim.rlim_cur);
					fprintf(stderr, "Use limit/ulimit to increase the limit before using pgbench.\n");
					exit(1);
				}
#endif   /* HAVE_GETRLIMIT */
				break;
			case 'j':			/* jobs */
				nthreads = atoi(optarg);
				if (nthreads <= 0)
				{
					fprintf(stderr, "invalid number of threads: %d\n", nthreads);
					exit(1);
				}
				break;
			case 'C':
				is_connect = true;
				break;
			case 'r':
				is_latencies = true;
				break;
			case 's':
				scale_given = true;
				scale = atoi(optarg);
				if (scale <= 0)
				{
					fprintf(stderr, "invalid scaling factor: %d\n", scale);
					exit(1);
				}
				break;
			case 't':
				if (duration > 0)
				{
					fprintf(stderr, "specify either a number of transactions (-t) or a duration (-T), not both.\n");
					exit(1);
				}
				nxacts = atoi(optarg);
				if (nxacts <= 0)
				{
					fprintf(stderr, "invalid number of transactions: %d\n", nxacts);
					exit(1);
				}
				break;
			case 'T':
				if (nxacts > 0)
				{
					fprintf(stderr, "specify either a number of transactions (-t) or a duration (-T), not both.\n");
					exit(1);
				}
				duration = atoi(optarg);
				if (duration <= 0)
				{
					fprintf(stderr, "invalid duration: %d\n", duration);
					exit(1);
				}
				break;
			case 'U':
				login = optarg;
				break;
			case 'l':
				use_log = true;
				break;
			case 'f':
				ttype = 3;
				filename = optarg;
				if (process_file(filename) == false || *sql_files[num_files - 1] == NULL)
					exit(1);
				break;
			case 'D':
				{
					char	   *p;

					if ((p = strchr(optarg, '=')) == NULL || p == optarg || *(p + 1) == '\0')
					{
						fprintf(stderr, "invalid variable definition: %s\n", optarg);
						exit(1);
					}

					*p++ = '\0';
					if (!putVariable(&state[0], "option", optarg, p))
						exit(1);
				}
				break;
			case 'F':
				fillfactor = atoi(optarg);
				if ((fillfactor < 10) || (fillfactor > 100))
				{
					fprintf(stderr, "invalid fillfactor: %d\n", fillfactor);
					exit(1);
				}
				break;
			case 'M':
				if (num_files > 0)
				{
					fprintf(stderr, "query mode (-M) should be specifiled before transaction scripts (-f)\n");
					exit(1);
				}
				for (querymode = 0; querymode < NUM_QUERYMODE; querymode++)
					if (strcmp(optarg, QUERYMODE[querymode]) == 0)
						break;
				if (querymode >= NUM_QUERYMODE)
				{
					fprintf(stderr, "invalid query mode (-M): %s\n", optarg);
					exit(1);
				}
				break;
			case 0:
				/* This covers long options which take no argument. */
				break;
			case 2:				/* tablespace */
				tablespace = optarg;
				break;
			case 3:				/* index-tablespace */
				index_tablespace = optarg;
				break;
			default:
				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
				exit(1);
				break;
		}
	}

	if (argc > optind)
		dbName = argv[optind];
	else
	{
		if ((env = getenv("PGDATABASE")) != NULL && *env != '\0')
			dbName = env;
		else if (login != NULL && *login != '\0')
			dbName = login;
		else
			dbName = "";
	}

	if (is_init_mode)
	{
		init(is_no_vacuum);
		exit(0);
	}

	/* Use DEFAULT_NXACTS if neither nxacts nor duration is specified. */
	if (nxacts <= 0 && duration <= 0)
		nxacts = DEFAULT_NXACTS;

	if (nclients % nthreads != 0)
	{
		fprintf(stderr, "number of clients (%d) must be a multiple of number of threads (%d)\n", nclients, nthreads);
		exit(1);
	}

	/*
	 * is_latencies only works with multiple threads in thread-based
	 * implementations, not fork-based ones, because it supposes that the
	 * parent can see changes made to the per-thread execution stats by child
	 * threads.  It seems useful enough to accept despite this limitation, but
	 * perhaps we should FIXME someday (by passing the stats data back up
	 * through the parent-to-child pipes).
	 */
#ifndef ENABLE_THREAD_SAFETY
	if (is_latencies && nthreads > 1)
	{
		fprintf(stderr, "-r does not work with -j larger than 1 on this platform.\n");
		exit(1);
	}
#endif

	/*
	 * save main process id in the global variable because process id will be
	 * changed after fork.
	 */
	main_pid = (int) getpid();

	if (nclients > 1)
	{
		state = (CState *) xrealloc(state, sizeof(CState) * nclients);
		memset(state + 1, 0, sizeof(CState) * (nclients - 1));

		/* copy any -D switch values to all clients */
		for (i = 1; i < nclients; i++)
		{
			int			j;

			state[i].id = i;
			for (j = 0; j < state[0].nvariables; j++)
			{
				if (!putVariable(&state[i], "startup", state[0].variables[j].name, state[0].variables[j].value))
					exit(1);
			}
		}
	}

	if (debug)
	{
		if (duration <= 0)
			printf("pghost: %s pgport: %s nclients: %d nxacts: %d dbName: %s\n",
				   pghost, pgport, nclients, nxacts, dbName);
		else
			printf("pghost: %s pgport: %s nclients: %d duration: %d dbName: %s\n",
				   pghost, pgport, nclients, duration, dbName);
	}

	/* opening connection... */
	con = doConnect();
	if (con == NULL)
		exit(1);

	if (PQstatus(con) == CONNECTION_BAD)
	{
		fprintf(stderr, "Connection to database '%s' failed.\n", dbName);
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}

	if (ttype != 3)
	{
		/*
		 * get the scaling factor that should be same as count(*) from
		 * pgbench_branches if this is not a custom query
		 */
		res = PQexec(con, "select count(*) from pgbench_branches");
		if (PQresultStatus(res) != PGRES_TUPLES_OK)
		{
			fprintf(stderr, "%s", PQerrorMessage(con));
			exit(1);
		}
		scale = atoi(PQgetvalue(res, 0, 0));
		if (scale < 0)
		{
			fprintf(stderr, "count(*) from pgbench_branches invalid (%d)\n", scale);
			exit(1);
		}
		PQclear(res);

		/* warn if we override user-given -s switch */
		if (scale_given)
			fprintf(stderr,
			"Scale option ignored, using pgbench_branches table count = %d\n",
					scale);
	}

	/*
	 * :scale variables normally get -s or database scale, but don't override
	 * an explicit -D switch
	 */
	if (getVariable(&state[0], "scale") == NULL)
	{
		snprintf(val, sizeof(val), "%d", scale);
		for (i = 0; i < nclients; i++)
		{
			if (!putVariable(&state[i], "startup", "scale", val))
				exit(1);
		}
	}

	if (!is_no_vacuum)
	{
		fprintf(stderr, "starting vacuum...");
		executeStatement(con, "vacuum pgbench_branches");
		executeStatement(con, "vacuum pgbench_tellers");
		executeStatement(con, "truncate pgbench_history");
		fprintf(stderr, "end.\n");

		if (do_vacuum_accounts)
		{
			fprintf(stderr, "starting vacuum pgbench_accounts...");
			executeStatement(con, "vacuum analyze pgbench_accounts");
			fprintf(stderr, "end.\n");
		}
	}
	PQfinish(con);

	/* set random seed */
	INSTR_TIME_SET_CURRENT(start_time);
	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));

	/* process builtin SQL scripts */
	switch (ttype)
	{
		case 0:
			sql_files[0] = process_builtin(tpc_b);
			num_files = 1;
			break;

		case 1:
			sql_files[0] = process_builtin(select_only);
			num_files = 1;
			break;

		case 2:
			sql_files[0] = process_builtin(simple_update);
			num_files = 1;
			break;

		default:
			break;
	}

	/* set up thread data structures */
	threads = (TState *) xmalloc(sizeof(TState) * nthreads);
	for (i = 0; i < nthreads; i++)
	{
		TState	   *thread = &threads[i];

		thread->tid = i;
		thread->state = &state[nclients / nthreads * i];
		thread->nstate = nclients / nthreads;
		thread->random_state[0] = random();
		thread->random_state[1] = random();
		thread->random_state[2] = random();

		if (is_latencies)
		{
			/* Reserve memory for the thread to store per-command latencies */
			int			t;

			thread->exec_elapsed = (instr_time *)
				xmalloc(sizeof(instr_time) * num_commands);
			thread->exec_count = (int *)
				xmalloc(sizeof(int) * num_commands);

			for (t = 0; t < num_commands; t++)
			{
				INSTR_TIME_SET_ZERO(thread->exec_elapsed[t]);
				thread->exec_count[t] = 0;
			}
		}
		else
		{
			thread->exec_elapsed = NULL;
			thread->exec_count = NULL;
		}
	}

	/* get start up time */
	INSTR_TIME_SET_CURRENT(start_time);

	/* set alarm if duration is specified. */
	if (duration > 0)
		setalarm(duration);

	/* start threads */
	for (i = 0; i < nthreads; i++)
	{
		TState	   *thread = &threads[i];

		INSTR_TIME_SET_CURRENT(thread->start_time);

		/* the first thread (i = 0) is executed by main thread */
		if (i > 0)
		{
			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);

			if (err != 0 || thread->thread == INVALID_THREAD)
			{
				fprintf(stderr, "cannot create thread: %s\n", strerror(err));
				exit(1);
			}
		}
		else
		{
			thread->thread = INVALID_THREAD;
		}
	}

	/* wait for threads and accumulate results */
	total_xacts = 0;
	INSTR_TIME_SET_ZERO(conn_total_time);
	for (i = 0; i < nthreads; i++)
	{
		void	   *ret = NULL;

		if (threads[i].thread == INVALID_THREAD)
			ret = threadRun(&threads[i]);
		else
			pthread_join(threads[i].thread, &ret);

		if (ret != NULL)
		{
			TResult    *r = (TResult *) ret;

			total_xacts += r->xacts;
			INSTR_TIME_ADD(conn_total_time, r->conn_time);
			free(ret);
		}
	}
	disconnect_all(state, nclients);

	/* get end time */
	INSTR_TIME_SET_CURRENT(total_time);
	INSTR_TIME_SUBTRACT(total_time, start_time);
	printResults(ttype, total_xacts, nclients, threads, nthreads,
				 total_time, conn_total_time);

	return 0;
}

static void *
threadRun(void *arg)
{
	TState	   *thread = (TState *) arg;
	CState	   *state = thread->state;
	TResult    *result;
	FILE	   *logfile = NULL; /* per-thread log file */
	instr_time	start,
				end;
	int			nstate = thread->nstate;
	int			remains = nstate;		/* number of remaining clients */
	int			i;

	result = xmalloc(sizeof(TResult));
	INSTR_TIME_SET_ZERO(result->conn_time);

	/* open log file if requested */
	if (use_log)
	{
		char		logpath[64];

		if (thread->tid == 0)
			snprintf(logpath, sizeof(logpath), "pgbench_log.%d", main_pid);
		else
			snprintf(logpath, sizeof(logpath), "pgbench_log.%d.%d", main_pid, thread->tid);
		logfile = fopen(logpath, "w");

		if (logfile == NULL)
		{
			fprintf(stderr, "Couldn't open logfile \"%s\": %s", logpath, strerror(errno));
			goto done;
		}
	}

	if (!is_connect)
	{
		/* make connections to the database */
		for (i = 0; i < nstate; i++)
		{
			if ((state[i].con = doConnect()) == NULL)
				goto done;
		}
	}

	/* time after thread and connections set up */
	INSTR_TIME_SET_CURRENT(result->conn_time);
	INSTR_TIME_SUBTRACT(result->conn_time, thread->start_time);

	/* send start up queries in async manner */
	for (i = 0; i < nstate; i++)
	{
		CState	   *st = &state[i];
		Command   **commands = sql_files[st->use_file];
		int			prev_ecnt = st->ecnt;

		st->use_file = getrand(thread, 0, num_files - 1);
		if (!doCustom(thread, st, &result->conn_time, logfile))
			remains--;			/* I've aborted */

		if (st->ecnt > prev_ecnt && commands[st->state]->type == META_COMMAND)
		{
			fprintf(stderr, "Client %d aborted in state %d. Execution meta-command failed.\n", i, st->state);
			remains--;			/* I've aborted */
			PQfinish(st->con);
			st->con = NULL;
		}
	}

	while (remains > 0)
	{
		fd_set		input_mask;
		int			maxsock;	/* max socket number to be waited */
		int64		now_usec = 0;
		int64		min_usec;

		FD_ZERO(&input_mask);

		maxsock = -1;
		min_usec = INT64_MAX;
		for (i = 0; i < nstate; i++)
		{
			CState	   *st = &state[i];
			Command   **commands = sql_files[st->use_file];
			int			sock;

			if (st->sleeping)
			{
				int			this_usec;

				if (min_usec == INT64_MAX)
				{
					instr_time	now;

					INSTR_TIME_SET_CURRENT(now);
					now_usec = INSTR_TIME_GET_MICROSEC(now);
				}

				this_usec = st->until - now_usec;
				if (min_usec > this_usec)
					min_usec = this_usec;
			}
			else if (st->con == NULL)
			{
				continue;
			}
			else if (commands[st->state]->type == META_COMMAND)
			{
				min_usec = 0;	/* the connection is ready to run */
				break;
			}

			sock = PQsocket(st->con);
			if (sock < 0)
			{
				fprintf(stderr, "bad socket: %s\n", strerror(errno));
				goto done;
			}

			FD_SET(sock, &input_mask);

			if (maxsock < sock)
				maxsock = sock;
		}

		if (min_usec > 0 && maxsock != -1)
		{
			int			nsocks; /* return from select(2) */

			if (min_usec != INT64_MAX)
			{
				struct timeval timeout;

				timeout.tv_sec = min_usec / 1000000;
				timeout.tv_usec = min_usec % 1000000;
				nsocks = select(maxsock + 1, &input_mask, NULL, NULL, &timeout);
			}
			else
				nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL);
			if (nsocks < 0)
			{
				if (errno == EINTR)
					continue;
				/* must be something wrong */
				fprintf(stderr, "select failed: %s\n", strerror(errno));
				goto done;
			}
		}

		/* ok, backend returns reply */
		for (i = 0; i < nstate; i++)
		{
			CState	   *st = &state[i];
			Command   **commands = sql_files[st->use_file];
			int			prev_ecnt = st->ecnt;

			if (st->con && (FD_ISSET(PQsocket(st->con), &input_mask)
							|| commands[st->state]->type == META_COMMAND))
			{
				if (!doCustom(thread, st, &result->conn_time, logfile))
					remains--;	/* I've aborted */
			}

			if (st->ecnt > prev_ecnt && commands[st->state]->type == META_COMMAND)
			{
				fprintf(stderr, "Client %d aborted in state %d. Execution of meta-command failed.\n", i, st->state);
				remains--;		/* I've aborted */
				PQfinish(st->con);
				st->con = NULL;
			}
		}
	}

done:
	INSTR_TIME_SET_CURRENT(start);
	disconnect_all(state, nstate);
	result->xacts = 0;
	for (i = 0; i < nstate; i++)
		result->xacts += state[i].cnt;
	INSTR_TIME_SET_CURRENT(end);
	INSTR_TIME_ACCUM_DIFF(result->conn_time, end, start);
	if (logfile)
		fclose(logfile);
	return result;
}


/*
 * Support for duration option: set timer_exceeded after so many seconds.
 */

#ifndef WIN32

static void
handle_sig_alarm(SIGNAL_ARGS)
{
	timer_exceeded = true;
}

static void
setalarm(int seconds)
{
	pqsignal(SIGALRM, handle_sig_alarm);
	alarm(seconds);
}

#ifndef ENABLE_THREAD_SAFETY

/*
 * implements pthread using fork.
 */

typedef struct fork_pthread
{
	pid_t		pid;
	int			pipes[2];
}	fork_pthread;

static int
pthread_create(pthread_t *thread,
			   pthread_attr_t *attr,
			   void *(*start_routine) (void *),
			   void *arg)
{
	fork_pthread *th;
	void	   *ret;

	th = (fork_pthread *) xmalloc(sizeof(fork_pthread));
	if (pipe(th->pipes) < 0)
	{
		free(th);
		return errno;
	}

	th->pid = fork();
	if (th->pid == -1)			/* error */
	{
		free(th);
		return errno;
	}
	if (th->pid != 0)			/* in parent process */
	{
		close(th->pipes[1]);
		*thread = th;
		return 0;
	}

	/* in child process */
	close(th->pipes[0]);

	/* set alarm again because the child does not inherit timers */
	if (duration > 0)
		setalarm(duration);

	ret = start_routine(arg);
	write(th->pipes[1], ret, sizeof(TResult));
	close(th->pipes[1]);
	free(th);
	exit(0);
}

static int
pthread_join(pthread_t th, void **thread_return)
{
	int			status;

	while (waitpid(th->pid, &status, 0) != th->pid)
	{
		if (errno != EINTR)
			return errno;
	}

	if (thread_return != NULL)
	{
		/* assume result is TResult */
		*thread_return = xmalloc(sizeof(TResult));
		if (read(th->pipes[0], *thread_return, sizeof(TResult)) != sizeof(TResult))
		{
			free(*thread_return);
			*thread_return = NULL;
		}
	}
	close(th->pipes[0]);

	free(th);
	return 0;
}
#endif
#else							/* WIN32 */

static VOID CALLBACK
win32_timer_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
{
	timer_exceeded = true;
}

static void
setalarm(int seconds)
{
	HANDLE		queue;
	HANDLE		timer;

	/* This function will be called at most once, so we can cheat a bit. */
	queue = CreateTimerQueue();
	if (seconds > ((DWORD) -1) / 1000 ||
		!CreateTimerQueueTimer(&timer, queue,
							   win32_timer_callback, NULL, seconds * 1000, 0,
							   WT_EXECUTEINTIMERTHREAD | WT_EXECUTEONLYONCE))
	{
		fprintf(stderr, "Failed to set timer\n");
		exit(1);
	}
}

/* partial pthread implementation for Windows */

typedef struct win32_pthread
{
	HANDLE		handle;
	void	   *(*routine) (void *);
	void	   *arg;
	void	   *result;
} win32_pthread;

static unsigned __stdcall
win32_pthread_run(void *arg)
{
	win32_pthread *th = (win32_pthread *) arg;

	th->result = th->routine(th->arg);

	return 0;
}

static int
pthread_create(pthread_t *thread,
			   pthread_attr_t *attr,
			   void *(*start_routine) (void *),
			   void *arg)
{
	int			save_errno;
	win32_pthread *th;

	th = (win32_pthread *) xmalloc(sizeof(win32_pthread));
	th->routine = start_routine;
	th->arg = arg;
	th->result = NULL;

	th->handle = (HANDLE) _beginthreadex(NULL, 0, win32_pthread_run, th, 0, NULL);
	if (th->handle == NULL)
	{
		save_errno = errno;
		free(th);
		return save_errno;
	}

	*thread = th;
	return 0;
}

static int
pthread_join(pthread_t th, void **thread_return)
{
	if (th == NULL || th->handle == NULL)
		return errno = EINVAL;

	if (WaitForSingleObject(th->handle, INFINITE) != WAIT_OBJECT_0)
	{
		_dosmaperr(GetLastError());
		return errno;
	}

	if (thread_return)
		*thread_return = th->result;

	CloseHandle(th->handle);
	free(th);
	return 0;
}

#endif   /* WIN32 */

#12

Heikki Linnakangas

hlinnakangas@vmware.com

over 13 years ago

In reply to: Amit Kapila (#11)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On 03.10.2012 19:03, Amit Kapila wrote:

Any comments/suggestions regarding performance/functionality test?

Hmm. Doing a lot of UPDATEs concurrently can be limited by the
WALInsertLock, which each inserter holds while copying the WAL record to
the buffer. Reducing the size of the WAL records, by compression or
delta encoding, alleviates that bottleneck: when WAL records are
smaller, the lock needs to be held for a shorter duration. That improves
throughput, even if individual backends need to do more CPU work to
compress the records, because that work can be done in parallel. I
suspect much of the benefit you're seeing in these tests might be
because of that effect.

As it happens, I've been working on making WAL insertion scale better in
general:
http://archives.postgresql.org/message-id/5064779A.3050407@vmware.com.
That should also help most when inserting large WAL records. The
question is: assuming we commit the xloginsert-scale patch, how much
benefit is there left from the compression? It will surely still help to
reduce the size of WAL, which can certainly help if you're limited by
the WAL I/O, but I suspect the results from the pgbench tests you run
might look quite different.

So, could you rerun these tests with the xloginsert-scale patch applied?
Reducing the WAL size might still be a good idea even if the patch
doesn't have much effect on TPS, but I'd like to make sure that the
compression doesn't hurt performance. Also, it would be a good idea to
repeat the tests with just a single client; we don't want to hurt the
performance in that scenario either.

- Heikki

#13

Amit Kapila

amit.kapila@huawei.com

over 13 years ago

In reply to: Heikki Linnakangas (#12)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Thursday, October 04, 2012 12:54 PM Heikki Linnakangas
On 03.10.2012 19:03, Amit Kapila wrote:

Any comments/suggestions regarding performance/functionality test?

Hmm. Doing a lot of UPDATEs concurrently can be limited by the
WALInsertLock, which each inserter holds while copying the WAL record to
the buffer. Reducing the size of the WAL records, by compression or
delta encoding, alleviates that bottleneck: when WAL records are
smaller, the lock needs to be held for a shorter duration. That improves
throughput, even if individual backends need to do more CPU work to
compress the records, because that work can be done in parallel. I
suspect much of the benefit you're seeing in these tests might be
because of that effect.

As it happens, I've been working on making WAL insertion scale better in
general:
http://archives.postgresql.org/message-id/5064779A.3050407@vmware.com.
That should also help most when inserting large WAL records. The
question is: assuming we commit the xloginsert-scale patch, how much
benefit is there left from the compression? It will surely still help to
reduce the size of WAL, which can certainly help if you're limited by
the WAL I/O, but I suspect the results from the pgbench tests you run
might look quite different.

So, could you rerun these tests with the xloginsert-scale patch applied?

I shall take care of doing the performance test with xloginsert-scale patch
as well
both for single and multi-thread.

With Regards,
Amit Kapila.

#14

Amit kapila

amit.kapila@huawei.com

over 13 years ago

In reply to: Amit Kapila (#11)

2 attachment(s)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote:
On Friday, September 28, 2012 7:03 PM Amit Kapila wrote:

On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:

On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
On 25.09.2012 18:27, Amit Kapila wrote:

If you feel it is must to do the comparison, we can do it in same

way

as we identify for HOT?

Now I shall do the various tests for following and post it here:
a. Attached Patch in the mode where it takes advantage of history
tuple b. By changing the logic for modified column calculation to use
calculation for memcmp()

1. Please find the results (pgbench_test.htm) for point -2 where there is
one fixed column updation (last few bytes are random) and second column
updation is 32 byte random string. The results for 50, 100 are still going
on others are attached with this mail.

The results for updated record size (50,100) is attached with this mail

Observations
a. The performance is comparable for both approaches

4. Complete testing for LZ compression patch using testcases defined for
original patch

a. During testing of LZ patch, few issues are found related to when the updated record contains NULLS. Working on it to fix.

The problems were that
a. offset calculation during compression is based on input buffer [new tuple] and oldtuple [history]. Offset should be relative to history end.
In normal LZ compression always input buffer and history will be adjacent, so there is no problem for it.
b. The new tuple contents should not be added to history buffer as the addresses will be different for new tuple and history. So it will make offset
calculation go wrong.

Patch containing fix of above problems is attached with this mail.

With Regards,
Amit Kapila.

Attachments:

pgbench_test_50_and_100.htmtext/html; name=pgbench_test_50_and_100.htmDownload

pglz_wal_update_v1.patchapplication/octet-stream; name=pglz_wal_update_v1.patchDownload

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 70,75 ****
--- 70,76 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
+ #include "utils/pg_lzcompress.h"
  
  
  /* GUC variable */
***************
*** 85,90 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
--- 86,92 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+ 				HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
***************
*** 3195,3204 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3197,3208 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 
! 		recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 								 newbuf, heaptup, &oldtup,
! 								 all_visible_cleared,
! 								 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
***************
*** 4428,4434 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
--- 4432,4438 ----
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
***************
*** 4437,4442 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 4441,4456 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	union
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char buf[BLCKSZ];
+ 	} buf;
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	char	   *oldtupdata;
+ 	int			oldtuplen;
+ 	bool		compressed = false;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 4446,4456 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4460,4503 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 	oldtupdata = ((char *) oldtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/* Is the update is going to the same page? */
+ 	if (oldbuf == newbuf)
+ 	{
+ 		/*
+ 		 * enable this if you only want to compress the new tuple as is,
+ 		 * without taking advantage of the old tuple.
+ 		 */
+ #ifdef COMPRESS_ONLY
+ 		oldtuplen = 0;
+ #endif
+ 
+ 		/* Delta-encode the new tuple using the old tuple */
+ 		/* XXX: assert that the output buffer is large enough (PGLZ_MAX_OUTPUT) */
+ 		if (pglz_compress_with_history(newtupdata, newtuplen,
+ 									   oldtupdata, oldtuplen,
+ 									   (PGLZ_Header *) &buf.pglzheader, NULL))
+ 		{
+ 			compressed = true;
+ 			newtupdata = (char *) &buf.pglzheader;
+ 			newtuplen = VARSIZE(&buf.pglzheader);
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 4478,4489 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	rdata[2].next = &(rdata[3]);
  
  	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
  
  	/* If new tuple is the single and first tuple on page... */
  	if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
  		PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
--- 4525,4537 ----
  	rdata[2].next = &(rdata[3]);
  
  	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
  
+ 
  	/* If new tuple is the single and first tuple on page... */
  	if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
  		PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
***************
*** 5232,5237 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5280,5287 ----
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtup = NULL;
+ 	uint32		old_tup_len = 0;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 5246,5252 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5296,5302 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5289,5295 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
--- 5339,5346 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtup = htup = (HeapTupleHeader) PageGetItem(page, lp);
! 	old_tup_len = ItemIdGetLength(lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
***************
*** 5308,5314 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5359,5365 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5330,5336 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5381,5387 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5380,5395 **** newsame:;
  	hsize = SizeOfHeapUpdate + SizeOfHeapHeader;
  
  	newlen = record->xl_len - hsize;
  	Assert(newlen <= MaxHeapTupleSize);
  	memcpy((char *) &xlhdr,
  		   (char *) xlrec + SizeOfHeapUpdate,
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 5431,5470 ----
  	hsize = SizeOfHeapUpdate + SizeOfHeapHeader;
  
  	newlen = record->xl_len - hsize;
+ 
  	Assert(newlen <= MaxHeapTupleSize);
  	memcpy((char *) &xlhdr,
  		   (char *) xlrec + SizeOfHeapUpdate,
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the new tuple was delta-encoded, decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
! 
! 		/*
! 		 * FIXME: this won't work on architectures with strict alignment,
! 		 * because encoded_data might not be aligned and pglz_decompress
! 		 * assumes that the PGLZ_Header is correctly aligned. XXX: also add
! 		 * some sanity checks with PGLZ_RAW_SIZE here.
! 		 */
! 		pglz_decompress_with_history(encoded_data,
! 									 ((char *) htup) + offsetof(HeapTupleHeaderData, t_bits),
! 									 ((char *) oldtup) + offsetof(HeapTupleHeaderData, t_bits),
! 									 old_tup_len - offsetof(HeapTupleHeaderData, t_bits));
! 		newlen = encoded_data->rawsize;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 5404,5410 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 5479,5485 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 373,379 **** do { \
   */
  static inline int
  pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
! 				int *lenp, int *offp, int good_match, int good_drop)
  {
  	PGLZ_HistEntry *hent;
  	int32		len = 0;
--- 373,380 ----
   */
  static inline int
  pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
! 				const char *hend, int *lenp, int *offp, int good_match,
! 				int good_drop)
  {
  	PGLZ_HistEntry *hent;
  	int32		len = 0;
***************
*** 391,399 **** pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
  		int32		thislen;
  
  		/*
  		 * Stop if the offset does not fit into our tag anymore.
  		 */
- 		thisoff = ip - hp;
  		if (thisoff >= 0x0fff)
  			break;
  
--- 392,408 ----
  		int32		thislen;
  
  		/*
+ 		 * Check If the history presents,
+ 		 * calculate the offset from history end instead of input
+ 		 */
+ 		if (NULL == hend)
+ 			thisoff = ip - hp;
+ 		else
+ 			thisoff = hend - hp;
+ 
+ 		/*
  		 * Stop if the offset does not fit into our tag anymore.
  		 */
  		if (thisoff >= 0x0fff)
  			break;
  
***************
*** 482,487 **** bool
--- 491,510 ----
  pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy)
  {
+ 	return pglz_compress_with_history(source, slen, NULL, 0, dest, strategy);
+ }
+ 
+ /*
+  * Like pglz_compress, but uses another piece of data to initialize the
+  * history table. When decompressing, you must pass the same history data
+  * to pglz_decompress_with_history(). This makes it possible to do simple
+  * delta compression.
+  */
+ bool
+ pglz_compress_with_history(const char *source, int32 slen,
+ 						   const char *history, int32 hlen,
+ 						   PGLZ_Header *dest, const PGLZ_Strategy *strategy)
+ {
  	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
  	unsigned char *bstart = bp;
  	int			hist_next = 0;
***************
*** 500,505 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
--- 523,530 ----
  	int32		result_size;
  	int32		result_max;
  	int32		need_rate;
+ 	const char 	*hp = NULL;
+ 	const char 	*hend = NULL;
  
  	/*
  	 * Our fallback strategy is the default.
***************
*** 560,565 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
--- 585,608 ----
  	 * hist_entries[] array; its entries are initialized as they are used.
  	 */
  	memset(hist_start, 0, sizeof(hist_start));
+ 	if (hlen > 0)
+ 	{
+ 		hp = history;
+ 		hend = history + hlen;
+ 		while (hp < hend)
+ 		{
+ 			/*
+ 			 * XXX: I think this doesn't handle the last few bytes of the
+ 			 * history correctly, or at least not in the most efficient way.
+ 			 * Logically, we should behave like the history and the source
+ 			 * strings are concatenated, but we use 'hend' here.
+ 			 */
+ 			pglz_hist_add(hist_start, hist_entries,
+ 						  hist_next, hist_recycle,
+ 						  hp, hend);
+ 			hp++;			/* Do not do this ++ in the line above! */
+ 		}
+ 	}
  
  	/*
  	 * Compress the source directly into the output buffer.
***************
*** 588,594 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  		/*
  		 * Try to find a match in the history
  		 */
! 		if (pglz_find_match(hist_start, dp, dend, &match_len,
  							&match_off, good_match, good_drop))
  		{
  			/*
--- 631,637 ----
  		/*
  		 * Try to find a match in the history
  		 */
! 		if (pglz_find_match(hist_start, dp, dend, hend, &match_len,
  							&match_off, good_match, good_drop))
  		{
  			/*
***************
*** 596,609 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			 * characters.
  			 */
  			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
! 			while (match_len--)
  			{
! 				pglz_hist_add(hist_start, hist_entries,
! 							  hist_next, hist_recycle,
! 							  dp, dend);
! 				dp++;			/* Do not do this ++ in the line above! */
! 				/* The macro would do it four times - Jan.	*/
  			}
  			found_match = true;
  		}
  		else
--- 639,670 ----
  			 * characters.
  			 */
  			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
! 
! 			/*
! 			 * Incase of histor is passed as a separate buffer then don't add
! 			 * source data further to the history. This is required as we need
! 			 * to calculate the offset in the history buffer.
! 			 */
! 			if (NULL == hend)
  			{
! 				while (match_len--)
! 				{
! 					pglz_hist_add(hist_start, hist_entries,
! 								  hist_next, hist_recycle,
! 								  dp, dend);
! 					dp++;			/* Do not do this ++ in the line above! */
! 					/* The macro would do it four times - Jan.	*/
! 				}
  			}
+ 			else
+ 			{
+ 				/*
+ 				 * Increment the source pointer with the match len directly
+ 				 * because source data is not adding to the history.
+ 				 */
+ 				dp += match_len;
+ 			}
+ 
  			found_match = true;
  		}
  		else
***************
*** 612,620 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			 * No match found. Copy one literal byte.
  			 */
  			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
! 			pglz_hist_add(hist_start, hist_entries,
! 						  hist_next, hist_recycle,
! 						  dp, dend);
  			dp++;				/* Do not do this ++ in the line above! */
  			/* The macro would do it four times - Jan.	*/
  		}
--- 673,687 ----
  			 * No match found. Copy one literal byte.
  			 */
  			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
! 
! 			/*
! 			 * Incase of history is passed as a separate buffer, don't add any
! 			 * unmatched input data to the history.
! 			 */
! 			if (NULL == hend)
! 				pglz_hist_add(hist_start, hist_entries,
! 							  hist_next, hist_recycle,
! 							  dp, dend);
  			dp++;				/* Do not do this ++ in the line above! */
  			/* The macro would do it four times - Jan.	*/
  		}
***************
*** 647,656 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
--- 714,734 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history(source, dest, NULL, 0);
+ }
+ 
+ void
+ pglz_decompress_with_history(const PGLZ_Header *source, char *dest,
+ 							 const char *history, int32 hlen)
+ {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
+ 	unsigned char *hend = NULL;
+ 
+ 	if (hlen > 0)
+ 		hend = (unsigned char *) history + hlen;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
  	srcend = ((const unsigned char *) source) + VARSIZE(source);
***************
*** 679,684 **** pglz_decompress(const PGLZ_Header *source, char *dest)
--- 757,763 ----
  				 */
  				int32		len;
  				int32		off;
+ 				int32		hoff;
  
  				len = (sp[0] & 0x0f) + 3;
  				off = ((sp[0] & 0xf0) << 4) | sp[1];
***************
*** 705,713 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  				 * memcpy() here, because the copied areas could overlap
  				 * extremely!
  				 */
  				while (len--)
  				{
! 					*dp = dp[-off];
  					dp++;
  				}
  			}
--- 784,804 ----
  				 * memcpy() here, because the copied areas could overlap
  				 * extremely!
  				 */
+ 				hoff = off;
  				while (len--)
  				{
! 					if (NULL == hend)
!  						*dp = dp[-off];
! 					else
! 					{
! 						/*
! 						 * hoff provides the offset in the history buffer from
! 						 * the history end
! 						 */
! 						Assert(hoff < hlen);
! 						*dp = hend[-hoff];
! 						hoff--;
! 					}
  					dp++;
  				}
  			}
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 142,157 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	char		flags;
! 
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 107,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
--- 107,118 ----
   */
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
+ extern bool pglz_compress_with_history(const char *source, int32 slen,
+ 						   const char *history, int32 hlen,
+ 						   PGLZ_Header *dest,
+ 						   const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+ extern void pglz_decompress_with_history(const PGLZ_Header *source, char *dest,
+ 										 const char *history, int32 hlen);
  
  #endif   /* _PG_LZCOMPRESS_H_ */

#15

Amit kapila

amit.kapila@huawei.com

over 13 years ago

In reply to: Amit kapila (#14)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Thursday, October 04, 2012 8:03 PM Heikki Linnakangas wrote:
On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote:
On Friday, September 28, 2012 7:03 PM Amit Kapila wrote:

On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:

On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
On 25.09.2012 18:27, Amit Kapila wrote:

If you feel it is must to do the comparison, we can do it in same

way

as we identify for HOT?

Now I shall do the various tests for following and post it here:
a. Attached Patch in the mode where it takes advantage of history
tuple b. By changing the logic for modified column calculation to use
calculation for memcmp()

1. Please find the results (pgbench_test.htm) for point -2 where there is
one fixed column updation (last few bytes are random) and second column
updation is 32 byte random string. The results for 50, 100 are still going
on others are attached with this mail.

Please find the readings of LZ patch along with Xlog-Scale patch.
The comparison is between for Update operations
base code + Xlog Scale Patch
base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)

The readings have been taken based on below data.
pgbench_xlog_scale_50 -
a. Updated Record size 50, Total Record size 1800
b. Threads 8, 1 ,2
c. Synchronous_commit - off, on

pgbench_xlog_scale_250 -
a. Updated Record size 250, Total Record size 1800
b. Threads 8, 1 ,2
c. Synchronous_commit - off, on

pgbench_xlog_scale_500-
a. Updated Record size 500, Total Record size 1800
b. Threads 8, 1 ,2
c. Synchronous_commit - off, on

Observations
--------------
a. There is still a good performance improvement even if we do Update WAL optimization on top of Xlog Sclaing Patch.
b. There is a slight performance dip for 1 thread (only in Sync mode = off) with Update WAL optimization (LZ compression)
but for 2 threads there is a performance increase.

With Regards,
Amit Kapila.

#16

Amit kapila

amit.kapila@huawei.com

over 13 years ago

In reply to: Amit kapila (#15)

3 attachment(s)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

Sorry, forgot to attach performance data. Its attached in this mail.

________________________________________
From: Amit kapila
Sent: Saturday, October 06, 2012 7:34 PM
To: 'Heikki Linnakangas'; noah@leadboat.com
Cc: pgsql-hackers@postgresql.org
Subject: RE: [HACKERS] Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Thursday, October 04, 2012 8:03 PM Heikki Linnakangas wrote:
On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote:
On Friday, September 28, 2012 7:03 PM Amit Kapila wrote:

On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:

On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
On 25.09.2012 18:27, Amit Kapila wrote:

If you feel it is must to do the comparison, we can do it in same

way

as we identify for HOT?

Now I shall do the various tests for following and post it here:
a. Attached Patch in the mode where it takes advantage of history
tuple b. By changing the logic for modified column calculation to use
calculation for memcmp()

1. Please find the results (pgbench_test.htm) for point -2 where there is
one fixed column updation (last few bytes are random) and second column
updation is 32 byte random string. The results for 50, 100 are still going
on others are attached with this mail.

The readings have been taken based on below data.
pgbench_xlog_scale_50 -
a. Updated Record size 50, Total Record size 1800
b. Threads 8, 1 ,2
c. Synchronous_commit - off, on

pgbench_xlog_scale_250 -
a. Updated Record size 250, Total Record size 1800
b. Threads 8, 1 ,2
c. Synchronous_commit - off, on

pgbench_xlog_scale_500-
a. Updated Record size 500, Total Record size 1800
b. Threads 8, 1 ,2
c. Synchronous_commit - off, on

With Regards,
Amit Kapila.

#17

Amit kapila

amit.kapila@huawei.com

over 13 years ago

In reply to: Amit kapila (#15)

2 attachment(s)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Saturday, October 06, 2012 7:34 PM Amit kapila wrote:
On Thursday, October 04, 2012 8:03 PM Heikki Linnakangas wrote:
On Wednesday, October 03, 2012 9:33 PM Amit Kapila wrote:
On Friday, September 28, 2012 7:03 PM Amit Kapila wrote:

On Thursday, September 27, 2012 6:39 PM Amit Kapila wrote:

On Thursday, September 27, 2012 4:12 PM Heikki Linnakangas wrote:
On 25.09.2012 18:27, Amit Kapila wrote:

1. Please find the results (pgbench_test.htm) for point -2 where there is
one fixed column updation (last few bytes are random) and second column
updation is 32 byte random string. The results for 50, 100 are still going
on others are attached with this mail.

Please find the readings of LZ patch along with Xlog-Scale patch.
The comparison is between for Update operations
base code + Xlog Scale Patch
base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)

Please find attached the recovery performance data and
updated patch to handle the unaligned access of PGLZ_Header in decompress by copying the header part to the local aligned address.

Recovery Performance
----------------------------
1. The recovery performance is also better with LZ compression Patch.

Please do let me know if any more data or test is required for this patch.

With Regards,
Amit Kapila.

Attachments:

pgbench_recovery_benchmark.htmtext/html; name=pgbench_recovery_benchmark.htmDownload

pglz_wal_update_v2.patchapplication/octet-stream; name=pglz_wal_update_v2.patchDownload

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 70,75 ****
--- 70,76 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
+ #include "utils/pg_lzcompress.h"
  
  
  /* GUC variable */
***************
*** 85,90 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
--- 86,92 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+ 				HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
***************
*** 3195,3204 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3197,3208 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 
! 		recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 								 newbuf, heaptup, &oldtup,
! 								 all_visible_cleared,
! 								 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
***************
*** 4428,4434 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
--- 4432,4438 ----
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
***************
*** 4437,4442 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 4441,4456 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	union
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[BLCKSZ];
+ 	}			buf;
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	char	   *oldtupdata;
+ 	int			oldtuplen;
+ 	bool		compressed = false;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 4446,4456 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4460,4506 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 	oldtupdata = ((char *) oldtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/* Is the update is going to the same page? */
+ 	if (oldbuf == newbuf)
+ 	{
+ 		/*
+ 		 * enable this if you only want to compress the new tuple as is,
+ 		 * without taking advantage of the old tuple.
+ 		 */
+ #ifdef COMPRESS_ONLY
+ 		oldtuplen = 0;
+ #endif
+ 
+ 		if (PGLZ_MAX_OUTPUT(newtuplen) < sizeof(buf))
+ 		{
+ 			/* Delta-encode the new tuple using the old tuple */
+ 			if (pglz_compress_with_history(newtupdata, newtuplen,
+ 										   oldtupdata, oldtuplen,
+ 									(PGLZ_Header *) (char *) &buf.pglzheader,
+ 										   NULL))
+ 			{
+ 				compressed = true;
+ 				newtupdata = (char *) &buf.pglzheader;
+ 				newtuplen = VARSIZE(&buf.pglzheader);
+ 			}
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 4478,4485 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	rdata[2].next = &(rdata[3]);
  
  	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 4528,4535 ----
  	rdata[2].next = &(rdata[3]);
  
  	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 5232,5237 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5282,5289 ----
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtup = NULL;
+ 	uint32		old_tup_len = 0;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 5246,5252 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5298,5304 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5289,5295 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
--- 5341,5348 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtup = htup = (HeapTupleHeader) PageGetItem(page, lp);
! 	old_tup_len = ItemIdGetLength(lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
***************
*** 5308,5314 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5361,5367 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5330,5336 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5383,5389 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5380,5395 **** newsame:;
  	hsize = SizeOfHeapUpdate + SizeOfHeapHeader;
  
  	newlen = record->xl_len - hsize;
  	Assert(newlen <= MaxHeapTupleSize);
  	memcpy((char *) &xlhdr,
  		   (char *) xlrec + SizeOfHeapUpdate,
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 5433,5467 ----
  	hsize = SizeOfHeapUpdate + SizeOfHeapHeader;
  
  	newlen = record->xl_len - hsize;
+ 
  	Assert(newlen <= MaxHeapTupleSize);
  	memcpy((char *) &xlhdr,
  		   (char *) xlrec + SizeOfHeapUpdate,
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the new tuple was delta-encoded, decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		char	   *encoded_data = (((char *) xlrec) + hsize);
! 
! 		/* XXX: also add some sanity checks with PGLZ_RAW_SIZE here.*/
! 		pglz_decompress_with_history(encoded_data,
! 					 ((char *) htup) + offsetof(HeapTupleHeaderData, t_bits),
! 									 &newlen,
! 				   ((char *) oldtup) + offsetof(HeapTupleHeaderData, t_bits),
! 						old_tup_len - offsetof(HeapTupleHeaderData, t_bits));
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 5404,5410 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 5476,5482 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 373,379 **** do { \
   */
  static inline int
  pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
! 				int *lenp, int *offp, int good_match, int good_drop)
  {
  	PGLZ_HistEntry *hent;
  	int32		len = 0;
--- 373,380 ----
   */
  static inline int
  pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
! 				const char *hend, int *lenp, int *offp, int good_match,
! 				int good_drop)
  {
  	PGLZ_HistEntry *hent;
  	int32		len = 0;
***************
*** 391,399 **** pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
  		int32		thislen;
  
  		/*
  		 * Stop if the offset does not fit into our tag anymore.
  		 */
- 		thisoff = ip - hp;
  		if (thisoff >= 0x0fff)
  			break;
  
--- 392,408 ----
  		int32		thislen;
  
  		/*
+ 		 * Check If the history presents, calculate the offset from history
+ 		 * end instead of input
+ 		 */
+ 		if (NULL == hend)
+ 			thisoff = ip - hp;
+ 		else
+ 			thisoff = hend - hp;
+ 
+ 		/*
  		 * Stop if the offset does not fit into our tag anymore.
  		 */
  		if (thisoff >= 0x0fff)
  			break;
  
***************
*** 482,487 **** bool
--- 491,510 ----
  pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy)
  {
+ 	return pglz_compress_with_history(source, slen, NULL, 0, dest, strategy);
+ }
+ 
+ /*
+  * Like pglz_compress, but uses another piece of data to initialize the
+  * history table. When decompressing, you must pass the same history data
+  * to pglz_decompress_with_history(). This makes it possible to do simple
+  * delta compression.
+  */
+ bool
+ pglz_compress_with_history(const char *source, int32 slen,
+ 						   const char *history, int32 hlen,
+ 						   PGLZ_Header *dest, const PGLZ_Strategy *strategy)
+ {
  	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
  	unsigned char *bstart = bp;
  	int			hist_next = 0;
***************
*** 500,505 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
--- 523,530 ----
  	int32		result_size;
  	int32		result_max;
  	int32		need_rate;
+ 	const char *hp = NULL;
+ 	const char *hend = NULL;
  
  	/*
  	 * Our fallback strategy is the default.
***************
*** 560,565 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
--- 585,608 ----
  	 * hist_entries[] array; its entries are initialized as they are used.
  	 */
  	memset(hist_start, 0, sizeof(hist_start));
+ 	if (hlen > 0)
+ 	{
+ 		hp = history;
+ 		hend = history + hlen;
+ 		while (hp < hend)
+ 		{
+ 			/*
+ 			 * XXX: I think this doesn't handle the last few bytes of the
+ 			 * history correctly, or at least not in the most efficient way.
+ 			 * Logically, we should behave like the history and the source
+ 			 * strings are concatenated, but we use 'hend' here.
+ 			 */
+ 			pglz_hist_add(hist_start, hist_entries,
+ 						  hist_next, hist_recycle,
+ 						  hp, hend);
+ 			hp++;				/* Do not do this ++ in the line above! */
+ 		}
+ 	}
  
  	/*
  	 * Compress the source directly into the output buffer.
***************
*** 588,594 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  		/*
  		 * Try to find a match in the history
  		 */
! 		if (pglz_find_match(hist_start, dp, dend, &match_len,
  							&match_off, good_match, good_drop))
  		{
  			/*
--- 631,637 ----
  		/*
  		 * Try to find a match in the history
  		 */
! 		if (pglz_find_match(hist_start, dp, dend, hend, &match_len,
  							&match_off, good_match, good_drop))
  		{
  			/*
***************
*** 596,608 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			 * characters.
  			 */
  			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
! 			while (match_len--)
  			{
! 				pglz_hist_add(hist_start, hist_entries,
! 							  hist_next, hist_recycle,
! 							  dp, dend);
! 				dp++;			/* Do not do this ++ in the line above! */
! 				/* The macro would do it four times - Jan.	*/
  			}
  			found_match = true;
  		}
--- 639,668 ----
  			 * characters.
  			 */
  			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
! 
! 			/*
! 			 * Incase of histor is passed as a separate buffer then don't add
! 			 * source data further to the history. This is required as we need
! 			 * to calculate the offset in the history buffer.
! 			 */
! 			if (NULL == hend)
  			{
! 				while (match_len--)
! 				{
! 					pglz_hist_add(hist_start, hist_entries,
! 								  hist_next, hist_recycle,
! 								  dp, dend);
! 					dp++;		/* Do not do this ++ in the line above! */
! 					/* The macro would do it four times - Jan.	*/
! 				}
! 			}
! 			else
! 			{
! 				/*
! 				 * Increment the source pointer with the match len directly
! 				 * because source data is not adding to the history.
! 				 */
! 				dp += match_len;
  			}
  			found_match = true;
  		}
***************
*** 612,620 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			 * No match found. Copy one literal byte.
  			 */
  			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
! 			pglz_hist_add(hist_start, hist_entries,
! 						  hist_next, hist_recycle,
! 						  dp, dend);
  			dp++;				/* Do not do this ++ in the line above! */
  			/* The macro would do it four times - Jan.	*/
  		}
--- 672,686 ----
  			 * No match found. Copy one literal byte.
  			 */
  			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
! 
! 			/*
! 			 * Incase of history is passed as a separate buffer, don't add any
! 			 * unmatched input data to the history.
! 			 */
! 			if (NULL == hend)
! 				pglz_hist_add(hist_start, hist_entries,
! 							  hist_next, hist_recycle,
! 							  dp, dend);
  			dp++;				/* Do not do this ++ in the line above! */
  			/* The macro would do it four times - Jan.	*/
  		}
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 713,753 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL, 0);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest by using the specified history.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history, int32 hlen)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
+ 	unsigned char *hend = NULL;
+ 
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
+ 	if (hlen > 0)
+ 		hend = (unsigned char *) history + hlen;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 679,684 **** pglz_decompress(const PGLZ_Header *source, char *dest)
--- 771,777 ----
  				 */
  				int32		len;
  				int32		off;
+ 				int32		hoff;
  
  				len = (sp[0] & 0x0f) + 3;
  				off = ((sp[0] & 0xf0) << 4) | sp[1];
***************
*** 705,713 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  				 * memcpy() here, because the copied areas could overlap
  				 * extremely!
  				 */
  				while (len--)
  				{
! 					*dp = dp[-off];
  					dp++;
  				}
  			}
--- 798,818 ----
  				 * memcpy() here, because the copied areas could overlap
  				 * extremely!
  				 */
+ 				hoff = off;
  				while (len--)
  				{
! 					if (NULL == hend)
! 						*dp = dp[-off];
! 					else
! 					{
! 						/*
! 						 * hoff provides the offset in the history buffer from
! 						 * the history end
! 						 */
! 						Assert(hoff < hlen);
! 						*dp = hend[-hoff];
! 						hoff--;
! 					}
  					dp++;
  				}
  			}
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 142,157 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	char		flags;
! 
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 107,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
--- 107,118 ----
   */
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
+ extern bool pglz_compress_with_history(const char *source, int32 slen,
+ 						   const char *history, int32 hlen,
+ 						   PGLZ_Header *dest,
+ 						   const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+ extern void pglz_decompress_with_history(const char *source, char *dest,
+ 						   uint32 *destlen, const char *history, int32 hlen);
  
  #endif   /* _PG_LZCOMPRESS_H_ */

#18

Alvaro Herrera

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Amit kapila (#1)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

Amit kapila wrote:

Rebased version of patch based on latest code.

Uhm, how can this patch change a caller of PageAddItem() by adding one
more argument, yet not touch bufpage.c at all? Are you sure this
compiles?

The email subject has a WIP tag; is that still the patch status? If so,
I assume it's okay to mark this Returned with Feedback and expect a
later version to be posted.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#19

Amit kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Alvaro Herrera (#18)

Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Wednesday, October 24, 2012 12:15 AM Alvaro Herrera wrote:

Amit kapila wrote:

Rebased version of patch based on latest code.

Uhm, how can this patch change a caller of PageAddItem() by adding one
more argument, yet not touch bufpage.c at all? Are you sure this
compiles?

It compiles, the same is confirmed even with latest Head.
Can you please point me if you feel something is done wrong in the patch.

The email subject has a WIP tag; is that still the patch status? If so,
I assume it's okay to mark this Returned with Feedback and expect a
later version to be posted.

The WIP word is from original mail chain discussion. The current status is as follows:
I have update the patch with all bug fixes and performance results were posted. Noah has also taken the performance data.
He believes that there is discrepency in performance data, but actually the reason according to me is just the way I have posted the data.

Currently there is no clear feedback on which I can work, So I would be very thankfull to you if you can wait for some conclusion of the discussion.

With Regards,
Amit Kapila.