Re: Performance Improvement by reducing WAL for Update Operation

Started by Amit kapilaabout 13 years ago41 messages

amit.kapila@huawei.com

about 13 years ago

2 attachment(s)

On Mon, 29 Oct 2012 20:02:11 +0530 Amit Kapila wrote:

On Sunday, October 28, 2012 12:28 AM Heikki Linnakangas wrote:

One idea is to use the LZ format in the WAL record, but use your
memcmp() code to construct it. I believe the slow part in LZ compression
is in trying to locate matches in the "history", so if you just replace
that with your code that's aware of the column boundaries and uses
simple memcmp() to detect what parts changed, you could create LZ
compressed output just as quickly as the custom encoded format. It would
leave the door open for making the encoding smarter or to do actual
compression in the future, without changing the format and the code to
decode it.

This is good idea. I shall try it.

In the existing algorithm for storing the new data which is not present in
the history, it needs 1 control byte for
every 8 bytes of new data which can increase the size of the compressed
output as compare to our delta encoding approach.

Approach-2
---------------
Use only one bit for control data [0 - Length and new data, 1 - pick from
history based on OFFSET-LENGTH]
The modified bit value (0) is to handle the new field data as a continuous
stream of data, instead of treating every byte as a new data.

Attached are the patches

1. wal_update_changes_lz_v4 - to use LZ Approach with memcmp to construct WAL record

2. wal_update_changes_modified_lz_v5 - to use modified LZ Approach as mentioned above as Approach-2

The main Changes as compare to previous patch are as follows:

1. In heap_delta_encode, use LZ encoding instead of Custom encoding.

2. Instead of get_tup_info(), introduced heap_getattr_with_len() macro based on suggestion from Noah.

3. LZ macro's moved from .c to .h, as they need to be used for encoding.

4. Changed the format for function arguments for heap_delta_encode()/heap_delta_decode() based on suggestion from Noah.

Performance Data:

[X]

Results:
Threads

Patch
Tps

wal size(GB)

Tps

wal size(GB)

Tps

wal size(GB)

Tps

wal size(GB)

Xlog scale
861

4.36

1463

7.33

2135

10.74

2689

13.56

Xlog scale +Original LZ
892

2.46

1685

3.35

3232

6.02

5296

9.20

Xlog scale +Modified LZ
852

2.35

1664

3.25

3229

5.71

5431

8.68

These are still WIP patches. Some cleanup has to be done.

Apart from that, I think the reason why still the performance is not same as Custom delta encoding Approach, is that it has IGN command due to which for all

the unchanged data in end, there are no commands and it was able to form tuple in decode using old tuple.

I shall write the wal_update_changes_custom_delta_v6, and then we can compare all the three patches performance data and decide which one to go based on results.

Suggestions/Comments?

With Regards,

Amit Kapila.

Attachments:

wal_update_changes_lz_v4.patchapplication/octet-stream; name=wal_update_changes_lz_v4.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,65 ****
--- 60,66 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
  
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
***************
*** 297,308 **** heap_attisnull(HeapTuple tup, int attnum)
  }
  
  /* ----------------
!  *		nocachegetattr
   *
!  *		This only gets called from fastgetattr() macro, in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
--- 298,310 ----
  }
  
  /* ----------------
!  *		nocachegetattr_with_len
   *
!  *		This only gets called in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor and
!  *		outputs the length of the attribute value.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
***************
*** 320,328 **** heap_attisnull(HeapTuple tup, int attnum)
   * ----------------
   */
  Datum
! nocachegetattr(HeapTuple tuple,
  			   int attnum,
! 			   TupleDesc tupleDesc)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
--- 322,331 ----
   * ----------------
   */
  Datum
! nocachegetattr_with_len(HeapTuple tuple,
  			   int attnum,
! 			   TupleDesc tupleDesc,
! 			   int32	*len)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
***************
*** 381,386 **** nocachegetattr(HeapTuple tuple,
--- 384,391 ----
  		 */
  		if (att[attnum]->attcacheoff >= 0)
  		{
+ 			*len = att_getlength(att[attnum]->attlen,
+ 								tp + att[attnum]->attcacheoff);
  			return fetchatt(att[attnum],
  							tp + att[attnum]->attcacheoff);
  		}
***************
*** 507,515 **** nocachegetattr(HeapTuple tuple,
--- 512,534 ----
  		}
  	}
  
+ 	*len = att_getlength(att[attnum]->attlen, tp + off);
  	return fetchatt(att[attnum], tp + off);
  }
  
+ /*
+  *	nocachegetattr
+  */
+ Datum
+ nocachegetattr(HeapTuple tuple,
+ 			   int attnum,
+ 			   TupleDesc tupleDesc)
+ {
+ 	int32 len;
+ 
+ 	return nocachegetattr_with_len(tuple, attnum, tupleDesc, &len);
+ }
+ 
  /* ----------------
   *		heap_getsysattr
   *
***************
*** 618,623 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 637,1013 ----
  }
  
  /*
+  * Check if the specified attribute's value is same in both given tuples.
+  * Subroutine for HeapSatisfiesHOTUpdate.
+  */
+ bool
+ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
+ 					   HeapTuple tup1, HeapTuple tup2)
+ {
+ 	Datum		value1,
+ 				value2;
+ 	bool		isnull1,
+ 				isnull2;
+ 	Form_pg_attribute att;
+ 
+ 	/*
+ 	 * If it's a whole-tuple reference, say "not equal".  It's not really
+ 	 * worth supporting this case, since it could only succeed after a no-op
+ 	 * update, which is hardly a case worth optimizing for.
+ 	 */
+ 	if (attrnum == 0)
+ 		return false;
+ 
+ 	/*
+ 	 * Likewise, automatically say "not equal" for any system attribute other
+ 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ 	 * chain, or even to be set correctly yet in the new tuple.
+ 	 */
+ 	if (attrnum < 0)
+ 	{
+ 		if (attrnum != ObjectIdAttributeNumber &&
+ 			attrnum != TableOidAttributeNumber)
+ 			return false;
+ 	}
+ 
+ 	/*
+ 	 * Extract the corresponding values.  XXX this is pretty inefficient if
+ 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
+ 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
+ 	 * work for system columns ...
+ 	 */
+ 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
+ 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
+ 
+ 	/*
+ 	 * If one value is NULL and other is not, then they are certainly not
+ 	 * equal
+ 	 */
+ 	if (isnull1 != isnull2)
+ 		return false;
+ 
+ 	/*
+ 	 * If both are NULL, they can be considered equal.
+ 	 */
+ 	if (isnull1)
+ 		return true;
+ 
+ 	/*
+ 	 * We do simple binary comparison of the two datums.  This may be overly
+ 	 * strict because there can be multiple binary representations for the
+ 	 * same logical value.	But we should be OK as long as there are no false
+ 	 * positives.  Using a type-specific equality operator is messy because
+ 	 * there could be multiple notions of equality in different operator
+ 	 * classes; furthermore, we cannot safely invoke user-defined functions
+ 	 * while holding exclusive buffer lock.
+ 	 */
+ 	if (attrnum <= 0)
+ 	{
+ 		/* The only allowed system columns are OIDs, so do this */
+ 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ 	}
+ 	else
+ 	{
+ 		Assert(attrnum <= tupdesc->natts);
+ 		att = tupdesc->attrs[attrnum - 1];
+ 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ 	}
+ }
+ 
+ /* ----------------
+  * heap_delta_encode
+  *		Forms an encoded data from old and new tuple with the modified columns
+  *		using an algorithm similar to LZ algorithm.
+  *
+  *		tupleDesc - Tuple descriptor.
+  *		oldtup - pointer to the old/history tuple.
+  *		newtup - pointer to the new tuple.
+  *		encdata - pointer to the encoded data using lz algorithm.
+  *
+  *      Encode the bitmap [+padding] [+oid] as a new data. And loop for all
+  *		attributes to find any modifications in the attributes.
+  *
+  *		The unmodified data is encoded as a history tag to the output and the
+  *      modifed data is encoded as new data to the output.
+  *
+  *		If the encoded output data is less than 75% of original data,
+  *		The output data is considered as encoded and proceed further.
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  PGLZ_Header *encdata)
+ {
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	int			numberOfAttributes;
+ 	int32		new_off = 0,
+ 				old_off = 0,
+ 				temp_off = 0,
+ 				match_off = 0,
+ 				change_off = 0;
+ 	int			attnum;
+ 	int32		data_len,
+ 				old_pad_len,
+ 				new_pad_len;
+ 	bool		match_not_found = false,
+ 				isnull;
+ 	unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+ 	unsigned char *bstart = bp;
+ 	char *dp = (char *)newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	char *dstart = dp;
+ 	unsigned char ctrl_dummy = 0;
+ 	unsigned char *ctrlp = &ctrl_dummy;
+ 	unsigned char ctrlb = 0;
+ 	unsigned char ctrl = 0;
+ 	int32 		len,
+ 				old_bitmaplen,
+ 				new_bitmaplen,
+ 				new_tup_len;
+ 	int32		result_size;
+ 	int32		result_max;
+ 
+ 	/* Include the bitmap header in the lz encoded data. */
+ 	old_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * The maximum encoded data is of 75% of total size. The max tuple size
+ 	 * is already validated as it cannot be more than MaxHeapTupleSize.
+ 	 */
+ 	result_max = (new_tup_len * 75) / 100;
+ 	encdata->rawsize = new_tup_len;
+ 
+ 	/*
+ 	 * Check for output buffer is reached the result_max by advancing
+ 	 * the buffer by the calculated aproximate length for the
+ 	 * corresponding operation.
+ 	 */
+ 	if ((bp + (2 * new_bitmaplen)) - bstart >= result_max)
+ 		return false;
+ 
+ 	/* Copy the bitmap data from new tuple to the encoded data buffer */
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_bitmaplen, dp);
+ 	dstart = dp;
+ 
+ 	numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+ 	for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+ 	{
+ 		/*
+ 		 * If the attribute is modified by the update operation, store the
+ 		 * appropiate offsets in the WAL record, otherwise skip to the next
+ 		 * attribute.
+ 		 */
+ 		if (!heap_tuple_attr_equals(tupleDesc, attnum, oldtup, newtup))
+ 		{
+ 			match_not_found = true;
+ 			data_len = old_off - match_off;
+ 
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 			if ((bp + len) - bstart >= result_max)
+ 				return false;
+ 
+ 			/*
+ 			 * The match_off value is calculated w.r.t to the tuple t_hoff
+ 			 * value, the bit map len needs to be added to match_off to get the
+ 			 * actual start offfset from the old/history tuple.
+ 			 */
+ 			match_off += old_bitmaplen;
+ 
+ 			/*
+ 			 * If any unchanged data presents in the old and new tuples then
+ 			 * encode the data as it needs to copy from history tuple with len
+ 			 * and offset.
+ 			 */
+ 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off);
+ 
+ 			/*
+ 			 * Recalculate the old and new tuple offsets based on padding
+ 			 * present in the tuples
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				old_off = att_align_pointer(old_off,
+ 								  att[attnum - 1]->attalign,
+ 								  att[attnum - 1]->attlen,
+ 					(char *)oldtup->t_data + oldtup->t_data->t_hoff + old_off);
+ 			}
+ 
+ 			if (!HeapTupleHasNulls(newtup)
+ 				|| !att_isnull((attnum - 1), newtup->t_data->t_bits))
+ 			{
+ 				new_off = att_align_pointer(new_off,
+ 									  att[attnum - 1]->attalign,
+ 									  att[attnum - 1]->attlen,
+ 					(char *)newtup->t_data + newtup->t_data->t_hoff + new_off);
+ 			}
+ 
+ 			/* calculate the old tuple field length which needs to ignored */
+ 			heap_getattr_with_len(oldtup, attnum, tupleDesc, &isnull, &len);
+ 			old_off += len;
+ 
+ 			heap_getattr_with_len(newtup, attnum, tupleDesc, &isnull, &len);
+ 			new_off += len;
+ 
+ 			match_off = old_off;
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			data_len = new_off - change_off;
+ 			if ((bp + (2 * data_len)) - bstart >= result_max)
+ 				return false;
+ 
+ 			/* Copy the modified column data to the output buffer if present */
+ 			pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 			/*
+ 			 * calculate the old tuple field start position, required to
+ 			 * ignore if any alignmet is present.
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				temp_off = old_off;
+ 				old_off = att_align_pointer(old_off,
+ 								  att[attnum - 1]->attalign,
+ 								  att[attnum - 1]->attlen,
+ 				 (char *)oldtup->t_data + oldtup->t_data->t_hoff + old_off);
+ 
+ 				old_pad_len = old_off - temp_off;
+ 
+ 				/*
+ 				 * calculate the new tuple field start position to check whether
+ 				 * any padding is required or not because field alignment.
+ 				 */
+ 				temp_off = new_off;
+ 				new_off = att_align_pointer(new_off,
+ 									  att[attnum - 1]->attalign,
+ 									  att[attnum - 1]->attlen,
+ 					(char *)newtup->t_data + newtup->t_data->t_hoff + new_off);
+ 				new_pad_len = new_off - temp_off;
+ 
+ 				/*
+ 				 * Checking for that is there any alignment difference between
+ 				 * old and new tuple attributes.
+ 				 */
+ 				if (old_pad_len != new_pad_len)
+ 				{
+ 					/*
+ 					 * If the alignment difference is found between old and new
+ 					 * tuples and the last attribute value of the new tuple is
+ 					 * same as old tuple then write the encode as history data
+ 					 * until the current match.
+ 					 *
+ 					 * If the last attribute value of new tuple is not same as
+ 					 * old tuple then the matched data marking as history is
+ 					 * already taken care.
+ 					 */
+ 					if (!match_not_found)
+ 					{
+ 						/*
+ 						 * Check for output buffer is reached the result_max by
+ 						 * advancing the buffer by the calculated aproximate
+ 						 * length for the corresponding operation.
+ 						 */
+ 						data_len = old_off - old_pad_len - match_off;
+ 						len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 						if ((bp + len) - bstart >= result_max)
+ 							return false;
+ 
+ 						match_off += old_bitmaplen;
+ 						pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off);
+ 					}
+ 
+ 					match_off = old_off;
+ 
+ 					/* Alignment data */
+ 					if ((bp + (2 * new_pad_len)) - bstart >= result_max)
+ 						return false;
+ 
+ 					pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_pad_len, dp);
+ 				}
+ 			}
+ 
+ 			heap_getattr_with_len(oldtup, attnum, tupleDesc, &isnull, &len);
+ 			old_off += len;
+ 
+ 			heap_getattr_with_len(newtup, attnum, tupleDesc, &isnull, &len);
+ 			new_off += len;
+ 
+ 			change_off = new_off;
+ 
+ 			/*
+ 			 * Recalculate the destination pointer with the new offset which is
+ 			 * used while copying the modified data.
+ 			 */
+ 			dp = dstart + new_off;
+ 			match_not_found = false;
+ 		}
+ 	}
+ 
+ 	/* If any modified column data presents then copy it. */
+ 	data_len = new_off - change_off;
+ 	if ((bp + (2 * data_len)) - bstart >= result_max)
+ 		return false;
+ 
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 	/* If any left out old tuple data presents then copy it as history */
+ 	data_len = old_off - match_off;
+ 	len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 	if ((bp + len) - bstart >= result_max)
+ 		return false;
+ 
+ 	match_off += old_bitmaplen;
+ 	pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off);
+ 
+ 	/*
+ 	 * Write out the last control byte and check that we haven't overrun the
+ 	 * output size allowed by the strategy.
+ 	 */
+ 	*ctrlp = ctrlb;
+ 
+ 	result_size = bp - bstart;
+ 	if (result_size >= result_max)
+ 		return false;
+ 
+ 	/*
+ 	 * Success - need only fill in the actual length of the compressed datum.
+ 	 */
+ 	SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+ 	return true;
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *		Decodes the encoded data to dest tuple with the help of history.
+  *
+  *		encdata - Pointer to the encoded data.
+  *		oldtup - pointer to the history tuple.
+  *		newtup - pointer to the destination tuple.
+  * ----------------
+  */
+ void
+ heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	return pglz_decompress_with_history((char *)encdata,
+ 				(char *)newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 				&newtup->t_len,
+ 				(char *)oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
+ /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
   *		which are of the length indicated by tupleDescriptor->natts
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 85,90 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
--- 85,91 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+ 				HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
***************
*** 844,851 **** heapgettup_pagemode(HeapScanDesc scan,
   * definition in access/htup.h is maintained.
   */
  Datum
! fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
! 			bool *isnull)
  {
  	return (
  			(attnum) > 0 ?
--- 845,852 ----
   * definition in access/htup.h is maintained.
   */
  Datum
! fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
! 			bool *isnull, int32 *len)
  {
  	return (
  			(attnum) > 0 ?
***************
*** 855,877 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			 (
  			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
  			  (
! 			   fetchatt((tupleDesc)->attrs[(attnum) - 1],
  						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr((tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
  			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
  			  (
  			   (*(isnull) = true),
  			   (Datum) NULL
  			   )
  			  :
  			  (
! 			   nocachegetattr((tup), (attnum), (tupleDesc))
  			   )
  			  )
  			 )
--- 856,882 ----
  			 (
  			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
  			  (
! 				(*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
! 							(char *)(tup)->t_data + (tup)->t_data->t_hoff +
! 							(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
! 			    fetchatt((tupleDesc)->attrs[(attnum) - 1],
  						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
  			  )
  			 :
  			 (
  			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
  			  (
  			   (*(isnull) = true),
+ 			   (*(len) = 0),
  			   (Datum) NULL
  			   )
  			  :
  			  (
! 			   nocachegetattr_with_len((tup), (attnum), (tupleDesc), len)
  			   )
  			  )
  			 )
***************
*** 881,886 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
--- 886,903 ----
  			 )
  		);
  }
+ /*
+  * This is formatted so oddly so that the correspondence to the macro
+  * definition in access/htup.h is maintained.
+  */
+ Datum
+ fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+ 			bool *isnull)
+ {
+ 	int32 len;
+ 
+ 	return fastgetattr_with_len(tup, attnum, tupleDesc, isnull, &len);
+ }
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  
***************
*** 3200,3209 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3217,3228 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 
! 		recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 								 newbuf, heaptup, &oldtup,
! 								 all_visible_cleared,
! 								 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
***************
*** 3263,3346 **** l2:
  }
  
  /*
-  * Check if the specified attribute's value is same in both given tuples.
-  * Subroutine for HeapSatisfiesHOTUpdate.
-  */
- static bool
- heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
- 					   HeapTuple tup1, HeapTuple tup2)
- {
- 	Datum		value1,
- 				value2;
- 	bool		isnull1,
- 				isnull2;
- 	Form_pg_attribute att;
- 
- 	/*
- 	 * If it's a whole-tuple reference, say "not equal".  It's not really
- 	 * worth supporting this case, since it could only succeed after a no-op
- 	 * update, which is hardly a case worth optimizing for.
- 	 */
- 	if (attrnum == 0)
- 		return false;
- 
- 	/*
- 	 * Likewise, automatically say "not equal" for any system attribute other
- 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
- 	 * chain, or even to be set correctly yet in the new tuple.
- 	 */
- 	if (attrnum < 0)
- 	{
- 		if (attrnum != ObjectIdAttributeNumber &&
- 			attrnum != TableOidAttributeNumber)
- 			return false;
- 	}
- 
- 	/*
- 	 * Extract the corresponding values.  XXX this is pretty inefficient if
- 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
- 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
- 	 * work for system columns ...
- 	 */
- 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
- 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
- 
- 	/*
- 	 * If one value is NULL and other is not, then they are certainly not
- 	 * equal
- 	 */
- 	if (isnull1 != isnull2)
- 		return false;
- 
- 	/*
- 	 * If both are NULL, they can be considered equal.
- 	 */
- 	if (isnull1)
- 		return true;
- 
- 	/*
- 	 * We do simple binary comparison of the two datums.  This may be overly
- 	 * strict because there can be multiple binary representations for the
- 	 * same logical value.	But we should be OK as long as there are no false
- 	 * positives.  Using a type-specific equality operator is messy because
- 	 * there could be multiple notions of equality in different operator
- 	 * classes; furthermore, we cannot safely invoke user-defined functions
- 	 * while holding exclusive buffer lock.
- 	 */
- 	if (attrnum <= 0)
- 	{
- 		/* The only allowed system columns are OIDs, so do this */
- 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
- 	}
- 	else
- 	{
- 		Assert(attrnum <= tupdesc->natts);
- 		att = tupdesc->attrs[attrnum - 1];
- 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
- 	}
- }
- 
- /*
   * Check if the old and new tuples represent a HOT-safe update. To be able
   * to do a HOT update, we must not have changed any columns used in index
   * definitions.
--- 3282,3287 ----
***************
*** 4435,4441 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
--- 4376,4382 ----
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
***************
*** 4444,4449 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 4385,4401 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	int			oldtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds max output possible from the LZ algorithm */
+ 	struct
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[MaxHeapTupleSize];
+ 	}			buf;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 4453,4463 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4405,4445 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 	oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/* Is the update is going to the same page? */
+ 	if (oldbuf == newbuf)
+ 	{
+ 		/*
+ 		 * LZ algorithm can hold only history offset in the range of 1 - 4095.
+ 		 * so the delta encode is restricted for the tuples with length more
+ 		 * than PGLZ_HISTORY_SIZE.
+ 		 */
+ 		if (oldtuplen < PGLZ_HISTORY_SIZE)
+ 		{
+ 			/* Delta-encode the new tuple using the old tuple */
+ 			if (heap_delta_encode(reln->rd_att, oldtup, newtup,
+ 									&buf.pglzheader))
+ 			{
+ 				compressed = true;
+ 				newtupdata = (char *)&buf.pglzheader;
+ 				newtuplen = VARSIZE(&buf.pglzheader);
+ 			}
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 4484,4492 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 4466,4480 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * -----------
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
! 	 *					OR
! 	 * PG93FORMAT [If encoded]: LZ header + Encoded data
! 	 * -----------
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 5238,5244 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5226,5235 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 5253,5259 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5244,5250 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5296,5302 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
--- 5287,5293 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
***************
*** 5315,5321 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5306,5312 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5337,5343 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5328,5334 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5393,5402 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 5384,5412 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the new tuple was delta-encoded, decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/* PG93FORMAT: LZ header + Encoded data */
! 		PGLZ_Header *encoded_data = (PGLZ_Header   *)(((char *) xlrec) + hsize);
! 
! 		oldtup.t_data = oldtupdata;
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode(encoded_data, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 5411,5417 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 5421,5427 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 182,190 ****
   */
  #define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
  #define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
- #define PGLZ_HISTORY_SIZE		4096
- #define PGLZ_MAX_MATCH			273
- 
  
  /* ----------
   * PGLZ_HistEntry -
--- 182,187 ----
***************
*** 302,368 **** do {									\
  			}																\
  } while (0)
  
- 
- /* ----------
-  * pglz_out_ctrl -
-  *
-  *		Outputs the last and allocates a new control byte if needed.
-  * ----------
-  */
- #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
- do { \
- 	if ((__ctrl & 0xff) == 0)												\
- 	{																		\
- 		*(__ctrlp) = __ctrlb;												\
- 		__ctrlp = (__buf)++;												\
- 		__ctrlb = 0;														\
- 		__ctrl = 1;															\
- 	}																		\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_literal -
-  *
-  *		Outputs a literal byte to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	*(_buf)++ = (unsigned char)(_byte);										\
- 	_ctrl <<= 1;															\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_tag -
-  *
-  *		Outputs a backward reference tag of 2-4 bytes (depending on
-  *		offset and length) to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	_ctrlb |= _ctrl;														\
- 	_ctrl <<= 1;															\
- 	if (_len > 17)															\
- 	{																		\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
- 		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
- 		(_buf)[2] = (unsigned char)((_len) - 18);							\
- 		(_buf) += 3;														\
- 	} else {																\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
- 		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
- 		(_buf) += 2;														\
- 	}																		\
- } while (0)
- 
- 
  /* ----------
   * pglz_find_match -
   *
--- 299,304 ----
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 583,620 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest.
+  *		To decompress, it uses history if provided.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 699,714 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  					break;
  				}
  
! 				/*
! 				 * Now we copy the bytes specified by the tag from OUTPUT to
! 				 * OUTPUT. It is dangerous and platform dependent to use
! 				 * memcpy() here, because the copied areas could overlap
! 				 * extremely!
! 				 */
! 				while (len--)
  				{
! 					*dp = dp[-off];
! 					dp++;
  				}
  			}
  			else
--- 658,685 ----
  					break;
  				}
  
! 				if (history)
! 				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from history to
! 					 * OUTPUT.
! 					 */
! 					memcpy(dp, history + off, len);
! 					dp += len;
! 				}
! 				else
  				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from OUTPUT to
! 					 * OUTPUT. It is dangerous and platform dependent to use
! 					 * memcpy() here, because the copied areas could overlap
! 					 * extremely!
! 					 */
! 					while (len--)
! 					{
! 						*dp = dp[-off];
! 						dp++;
! 					}
  				}
  			}
  			else
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 142,161 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	char		flags;			/* flag bits, see below */
! 
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! 
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01 /* Indicates as old page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02 /* Indicates as new page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04 /* Indicates as the update
! 												operation is delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 18,23 ****
--- 18,24 ----
  #include "access/tupdesc.h"
  #include "access/tupmacs.h"
  #include "storage/bufpage.h"
+ #include "utils/pg_lzcompress.h"
  
  /*
   * MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
***************
*** 528,533 **** struct MinimalTupleData
--- 529,535 ----
  		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
  
  
+ #if !defined(DISABLE_COMPLEX_MACRO)
  /* ----------------
   *		fastgetattr
   *
***************
*** 542,550 **** struct MinimalTupleData
   *		lookups, and call nocachegetattr() for the rest.
   * ----------------
   */
- 
- #if !defined(DISABLE_COMPLEX_MACRO)
- 
  #define fastgetattr(tup, attnum, tupleDesc, isnull)					\
  (																	\
  	AssertMacro((attnum) > 0),										\
--- 544,549 ----
***************
*** 572,585 **** struct MinimalTupleData
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
  )
- #else							/* defined(DISABLE_COMPLEX_MACRO) */
  
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
- 
  /* ----------------
   *		heap_getattr
   *
--- 571,626 ----
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
+ )																	\
+ 
+ /* ----------------
+  *		fastgetattr_with_len
+  *
+  *		Similar to fastgetattr and fetches the length of the given attribute
+  *		also.
+  * ----------------
+  */
+ #define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len)	\
+ (																	\
+ 	AssertMacro((attnum) > 0),										\
+ 	(*(isnull) = false),											\
+ 	HeapTupleNoNulls(tup) ? 										\
+ 	(																\
+ 		(tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ?			\
+ 		(															\
+ 			(*(len) = att_getlength(								\
+ 					(tupleDesc)->attrs[(attnum)-1]->attlen, 		\
+ 					(char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)),	\
+ 			fetchatt((tupleDesc)->attrs[(attnum)-1],				\
+ 				(char *) (tup)->t_data + (tup)->t_data->t_hoff +	\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)	\
+ 		)															\
+ 		:															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 	)																\
+ 	:																\
+ 	(																\
+ 		att_isnull((attnum)-1, (tup)->t_data->t_bits) ? 			\
+ 		(															\
+ 			(*(isnull) = true), 									\
+ 			(*(len) = 0),											\
+ 			(Datum)NULL 											\
+ 		)															\
+ 		:															\
+ 		(															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 		)															\
+ 	)																\
  )
  
+ #else							/* defined(DISABLE_COMPLEX_MACRO) */
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
+ extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+ 			TupleDesc tupleDesc, bool *isnull, int32 *len);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  /* ----------------
   *		heap_getattr
   *
***************
*** 596,616 **** extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
  	( \
! 		((attnum) > 0) ? \
  		( \
! 			((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
! 			( \
! 				(*(isnull) = true), \
! 				(Datum)NULL \
! 			) \
! 			: \
! 				fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
  		) \
  		: \
! 			heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	)
  
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
--- 637,679 ----
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
+ ( \
+ 	((attnum) > 0) ? \
  	( \
! 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
  		( \
! 			(*(isnull) = true), \
! 			(Datum)NULL \
  		) \
  		: \
! 			fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	) \
! 	: \
! 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
  
+ /* ----------------
+  *		heap_getattr_with_len
+  *
+  *		Similar to heap_getattr and outputs the length of the given attribute.
+  * ----------------
+  */
+ #define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ 	((attnum) > 0) ? \
+ 	( \
+ 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+ 		( \
+ 			(*(isnull) = true), \
+ 			(*(len) = 0), \
+ 			(Datum)NULL \
+ 		) \
+ 		: \
+ 			fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+ 	) \
+ 	: \
+ 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+ )
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
***************
*** 620,625 **** extern void heap_fill_tuple(TupleDesc tupleDesc,
--- 683,690 ----
  				char *data, Size data_size,
  				uint16 *infomask, bits8 *bit);
  extern bool heap_attisnull(HeapTuple tup, int attnum);
+ extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+ 			   TupleDesc att, int32 *len);
  extern Datum nocachegetattr(HeapTuple tup, int attnum,
  			   TupleDesc att);
  extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
***************
*** 636,641 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 701,713 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
+ 					   HeapTuple tup1, HeapTuple tup2);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				HeapTuple newtup, PGLZ_Header *encdata);
+ extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+ 				HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/tupmacs.h
--- b/src/include/access/tupmacs.h
***************
*** 187,192 ****
--- 187,214 ----
  )
  
  /*
+  * att_getlength -
+  *			Gets the length of the attribute.
+  */
+ #define att_getlength(attlen, attptr) \
+ ( \
+ 	((attlen) > 0) ? \
+ 	( \
+ 		(attlen) \
+ 	) \
+ 	: (((attlen) == -1) ? \
+ 	( \
+ 		VARSIZE_ANY(attptr) \
+ 	) \
+ 	: \
+ 	( \
+ 		AssertMacro((attlen) == -2), \
+ 		(strlen((char *) (attptr)) + 1) \
+ 	)) \
+ )
+ 
+ 
+ /*
   * store_att_byval is a partial inverse of fetch_att: store a given Datum
   * value into a tuple data area at the specified address.  However, it only
   * handles the byval case, because in typical usage the caller needs to
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 23,28 **** typedef struct PGLZ_Header
--- 23,30 ----
  	int32		rawsize;
  } PGLZ_Header;
  
+ #define PGLZ_HISTORY_SIZE		4096
+ #define PGLZ_MAX_MATCH			273
  
  /* ----------
   * PGLZ_MAX_OUTPUT -
***************
*** 86,91 **** typedef struct PGLZ_Strategy
--- 88,188 ----
  	int32		match_size_drop;
  } PGLZ_Strategy;
  
+ /*
+  * calculate the approximate length required for history encode tag for the
+  * given length
+  */
+ #define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+ ( \
+ 	((_len) < 17) ?	(3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH)))	\
+ )
+ 
+ /* ----------
+  * pglz_out_ctrl -
+  *
+  *		Outputs the last and allocates a new control byte if needed.
+  * ----------
+  */
+ #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+ do { \
+ 	if ((__ctrl & 0xff) == 0)												\
+ 	{																		\
+ 		*(__ctrlp) = __ctrlb;												\
+ 		__ctrlp = (__buf)++;												\
+ 		__ctrlb = 0;														\
+ 		__ctrl = 1; 														\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_literal -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+ do { \
+ 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 	*(_buf)++ = (unsigned char)(_byte); 									\
+ 	_ctrl <<= 1;															\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_tag -
+  *
+  *		Outputs a backward/history reference tag of 2-4 bytes (depending on
+  *		offset and length) to the destination buffer including the
+  *		appropriate control bit.
+  *
+  *		Split the process of backward/history reference as different chunks,
+  *		if the given lenght is more than max match and repeats the process
+  *		until the given length is processed.
+  * ----------
+  */
+ #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+ do { \
+ 	int _mlen;																	\
+ 	int _total_len = (_len);													\
+ 	while (_total_len > 0)														\
+ 	{																			\
+ 		_mlen = _total_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _total_len;		\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 		_ctrlb |= _ctrl;														\
+ 		_ctrl <<= 1;															\
+ 		if (_mlen > 17)															\
+ 		{																		\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+ 			(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+ 			(_buf)[2] = (unsigned char)((_mlen) - 18);							\
+ 			(_buf) += 3;														\
+ 		} else {																\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mlen) - 3)); \
+ 			(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+ 			(_buf) += 2;														\
+ 		}																		\
+ 		_total_len -= _mlen;													\
+ 		(_off) += _mlen;														\
+ 	}																			\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_add -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit until the given input length.
+  * ----------
+  */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ 	int32 _total_len = (_len);											\
+ 	while (_total_len-- > 0)											\
+ 	{																	\
+ 		pglz_out_literal(_ctrlp, _ctrlb, _ctrl, _buf, *(_byte));		\
+ 		(_byte) = (char *)(_byte) + 1;									\
+ 	}																	\
+ } while (0)
+ 
  
  /* ----------
   * The standard strategies
***************
*** 108,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! 
  #endif   /* _PG_LZCOMPRESS_H_ */
--- 205,210 ----
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! extern void pglz_decompress_with_history(const char *source, char *dest,
! 				uint32 *destlen, const char *history);
  #endif   /* _PG_LZCOMPRESS_H_ */

wal_update_changes_modified_lz_v5.patchapplication/octet-stream; name=wal_update_changes_modified_lz_v5.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,65 ****
--- 60,66 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
  
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
***************
*** 297,308 **** heap_attisnull(HeapTuple tup, int attnum)
  }
  
  /* ----------------
!  *		nocachegetattr
   *
!  *		This only gets called from fastgetattr() macro, in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
--- 298,310 ----
  }
  
  /* ----------------
!  *		nocachegetattr_with_len
   *
!  *		This only gets called in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor and
!  *		outputs the length of the attribute value.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
***************
*** 320,328 **** heap_attisnull(HeapTuple tup, int attnum)
   * ----------------
   */
  Datum
! nocachegetattr(HeapTuple tuple,
  			   int attnum,
! 			   TupleDesc tupleDesc)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
--- 322,331 ----
   * ----------------
   */
  Datum
! nocachegetattr_with_len(HeapTuple tuple,
  			   int attnum,
! 			   TupleDesc tupleDesc,
! 			   int32	*len)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
***************
*** 381,386 **** nocachegetattr(HeapTuple tuple,
--- 384,391 ----
  		 */
  		if (att[attnum]->attcacheoff >= 0)
  		{
+ 			*len = att_getlength(att[attnum]->attlen,
+ 								tp + att[attnum]->attcacheoff);
  			return fetchatt(att[attnum],
  							tp + att[attnum]->attcacheoff);
  		}
***************
*** 507,515 **** nocachegetattr(HeapTuple tuple,
--- 512,534 ----
  		}
  	}
  
+ 	*len = att_getlength(att[attnum]->attlen, tp + off);
  	return fetchatt(att[attnum], tp + off);
  }
  
+ /*
+  *	nocachegetattr
+  */
+ Datum
+ nocachegetattr(HeapTuple tuple,
+ 			   int attnum,
+ 			   TupleDesc tupleDesc)
+ {
+ 	int32 len;
+ 
+ 	return nocachegetattr_with_len(tuple, attnum, tupleDesc, &len);
+ }
+ 
  /* ----------------
   *		heap_getsysattr
   *
***************
*** 618,623 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 637,1013 ----
  }
  
  /*
+  * Check if the specified attribute's value is same in both given tuples.
+  * Subroutine for HeapSatisfiesHOTUpdate.
+  */
+ bool
+ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
+ 					   HeapTuple tup1, HeapTuple tup2)
+ {
+ 	Datum		value1,
+ 				value2;
+ 	bool		isnull1,
+ 				isnull2;
+ 	Form_pg_attribute att;
+ 
+ 	/*
+ 	 * If it's a whole-tuple reference, say "not equal".  It's not really
+ 	 * worth supporting this case, since it could only succeed after a no-op
+ 	 * update, which is hardly a case worth optimizing for.
+ 	 */
+ 	if (attrnum == 0)
+ 		return false;
+ 
+ 	/*
+ 	 * Likewise, automatically say "not equal" for any system attribute other
+ 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ 	 * chain, or even to be set correctly yet in the new tuple.
+ 	 */
+ 	if (attrnum < 0)
+ 	{
+ 		if (attrnum != ObjectIdAttributeNumber &&
+ 			attrnum != TableOidAttributeNumber)
+ 			return false;
+ 	}
+ 
+ 	/*
+ 	 * Extract the corresponding values.  XXX this is pretty inefficient if
+ 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
+ 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
+ 	 * work for system columns ...
+ 	 */
+ 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
+ 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
+ 
+ 	/*
+ 	 * If one value is NULL and other is not, then they are certainly not
+ 	 * equal
+ 	 */
+ 	if (isnull1 != isnull2)
+ 		return false;
+ 
+ 	/*
+ 	 * If both are NULL, they can be considered equal.
+ 	 */
+ 	if (isnull1)
+ 		return true;
+ 
+ 	/*
+ 	 * We do simple binary comparison of the two datums.  This may be overly
+ 	 * strict because there can be multiple binary representations for the
+ 	 * same logical value.	But we should be OK as long as there are no false
+ 	 * positives.  Using a type-specific equality operator is messy because
+ 	 * there could be multiple notions of equality in different operator
+ 	 * classes; furthermore, we cannot safely invoke user-defined functions
+ 	 * while holding exclusive buffer lock.
+ 	 */
+ 	if (attrnum <= 0)
+ 	{
+ 		/* The only allowed system columns are OIDs, so do this */
+ 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ 	}
+ 	else
+ 	{
+ 		Assert(attrnum <= tupdesc->natts);
+ 		att = tupdesc->attrs[attrnum - 1];
+ 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ 	}
+ }
+ 
+ /* ----------------
+  * heap_delta_encode
+  *		Forms an encoded data from old and new tuple with the modified columns
+  *		using an algorithm similar to LZ algorithm.
+  *
+  *		tupleDesc - Tuple descriptor.
+  *		oldtup - pointer to the old/history tuple.
+  *		newtup - pointer to the new tuple.
+  *		encdata - pointer to the encoded data using lz algorithm.
+  *
+  *      Encode the bitmap [+padding] [+oid] as a new data. And loop for all
+  *		attributes to find any modifications in the attributes.
+  *
+  *		The unmodified data is encoded as a history tag to the output and the
+  *      modifed data is encoded as new data to the output.
+  *
+  *		If the encoded output data is less than 75% of original data,
+  *		The output data is considered as encoded and proceed further.
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  PGLZ_Header *encdata)
+ {
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	int			numberOfAttributes;
+ 	int32		new_off = 0,
+ 				old_off = 0,
+ 				temp_off = 0,
+ 				match_off = 0,
+ 				change_off = 0;
+ 	int			attnum;
+ 	int32		data_len,
+ 				old_pad_len,
+ 				new_pad_len;
+ 	bool		match_not_found = false,
+ 				isnull;
+ 	unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+ 	unsigned char *bstart = bp;
+ 	char *dp = (char *)newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	char *dstart = dp;
+ 	unsigned char ctrl_dummy = 0;
+ 	unsigned char *ctrlp = &ctrl_dummy;
+ 	unsigned char ctrlb = 0;
+ 	unsigned char ctrl = 0;
+ 	int32 		len,
+ 				old_bitmaplen,
+ 				new_bitmaplen,
+ 				new_tup_len;
+ 	int32		result_size;
+ 	int32		result_max;
+ 
+ 	/* Include the bitmap header in the lz encoded data. */
+ 	old_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * The maximum encoded data is of 75% of total size. The max tuple size
+ 	 * is already validated as it cannot be more than MaxHeapTupleSize.
+ 	 */
+ 	result_max = (new_tup_len * 75) / 100;
+ 	encdata->rawsize = new_tup_len;
+ 
+ 	/*
+ 	 * Check for output buffer is reached the result_max by advancing
+ 	 * the buffer by the calculated aproximate length for the
+ 	 * corresponding operation.
+ 	 */
+ 	if ((bp + 2 + new_bitmaplen) - bstart >= result_max)
+ 		return false;
+ 
+ 	/* Copy the bitmap data from new tuple to the encoded data buffer */
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_bitmaplen, dp);
+ 	dstart = dp;
+ 
+ 	numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+ 	for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+ 	{
+ 		/*
+ 		 * If the attribute is modified by the update operation, store the
+ 		 * appropiate offsets in the WAL record, otherwise skip to the next
+ 		 * attribute.
+ 		 */
+ 		if (!heap_tuple_attr_equals(tupleDesc, attnum, oldtup, newtup))
+ 		{
+ 			match_not_found = true;
+ 			data_len = old_off - match_off;
+ 
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 			if ((bp + len) - bstart >= result_max)
+ 				return false;
+ 
+ 			/*
+ 			 * The match_off value is calculated w.r.t to the tuple t_hoff
+ 			 * value, the bit map len needs to be added to match_off to get the
+ 			 * actual start offfset from the old/history tuple.
+ 			 */
+ 			match_off += old_bitmaplen;
+ 
+ 			/*
+ 			 * If any unchanged data presents in the old and new tuples then
+ 			 * encode the data as it needs to copy from history tuple with len
+ 			 * and offset.
+ 			 */
+ 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off);
+ 
+ 			/*
+ 			 * Recalculate the old and new tuple offsets based on padding
+ 			 * present in the tuples
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				old_off = att_align_pointer(old_off,
+ 								  att[attnum - 1]->attalign,
+ 								  att[attnum - 1]->attlen,
+ 					(char *)oldtup->t_data + oldtup->t_data->t_hoff + old_off);
+ 			}
+ 
+ 			if (!HeapTupleHasNulls(newtup)
+ 				|| !att_isnull((attnum - 1), newtup->t_data->t_bits))
+ 			{
+ 				new_off = att_align_pointer(new_off,
+ 									  att[attnum - 1]->attalign,
+ 									  att[attnum - 1]->attlen,
+ 					(char *)newtup->t_data + newtup->t_data->t_hoff + new_off);
+ 			}
+ 
+ 			/* calculate the old tuple field length which needs to ignored */
+ 			heap_getattr_with_len(oldtup, attnum, tupleDesc, &isnull, &len);
+ 			old_off += len;
+ 
+ 			heap_getattr_with_len(newtup, attnum, tupleDesc, &isnull, &len);
+ 			new_off += len;
+ 
+ 			match_off = old_off;
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			data_len = new_off - change_off;
+ 			if ((bp + data_len + 2) - bstart >= result_max)
+ 				return false;
+ 
+ 			/* Copy the modified column data to the output buffer if present */
+ 			pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 			/*
+ 			 * calculate the old tuple field start position, required to
+ 			 * ignore if any alignmet is present.
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				temp_off = old_off;
+ 				old_off = att_align_pointer(old_off,
+ 								  att[attnum - 1]->attalign,
+ 								  att[attnum - 1]->attlen,
+ 				 (char *)oldtup->t_data + oldtup->t_data->t_hoff + old_off);
+ 
+ 				old_pad_len = old_off - temp_off;
+ 
+ 				/*
+ 				 * calculate the new tuple field start position to check whether
+ 				 * any padding is required or not because field alignment.
+ 				 */
+ 				temp_off = new_off;
+ 				new_off = att_align_pointer(new_off,
+ 									  att[attnum - 1]->attalign,
+ 									  att[attnum - 1]->attlen,
+ 					(char *)newtup->t_data + newtup->t_data->t_hoff + new_off);
+ 				new_pad_len = new_off - temp_off;
+ 
+ 				/*
+ 				 * Checking for that is there any alignment difference between
+ 				 * old and new tuple attributes.
+ 				 */
+ 				if (old_pad_len != new_pad_len)
+ 				{
+ 					/*
+ 					 * If the alignment difference is found between old and new
+ 					 * tuples and the last attribute value of the new tuple is
+ 					 * same as old tuple then write the encode as history data
+ 					 * until the current match.
+ 					 *
+ 					 * If the last attribute value of new tuple is not same as
+ 					 * old tuple then the matched data marking as history is
+ 					 * already taken care.
+ 					 */
+ 					if (!match_not_found)
+ 					{
+ 						/*
+ 						 * Check for output buffer is reached the result_max by
+ 						 * advancing the buffer by the calculated aproximate
+ 						 * length for the corresponding operation.
+ 						 */
+ 						data_len = old_off - old_pad_len - match_off;
+ 						len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 						if ((bp + len) - bstart >= result_max)
+ 							return false;
+ 
+ 						match_off += old_bitmaplen;
+ 						pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off);
+ 					}
+ 
+ 					match_off = old_off;
+ 
+ 					/* Alignment data */
+ 					if ((bp + new_pad_len + 2) - bstart >= result_max)
+ 						return false;
+ 
+ 					pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_pad_len, dp);
+ 				}
+ 			}
+ 
+ 			heap_getattr_with_len(oldtup, attnum, tupleDesc, &isnull, &len);
+ 			old_off += len;
+ 
+ 			heap_getattr_with_len(newtup, attnum, tupleDesc, &isnull, &len);
+ 			new_off += len;
+ 
+ 			change_off = new_off;
+ 
+ 			/*
+ 			 * Recalculate the destination pointer with the new offset which is
+ 			 * used while copying the modified data.
+ 			 */
+ 			dp = dstart + new_off;
+ 			match_not_found = false;
+ 		}
+ 	}
+ 
+ 	/* If any modified column data presents then copy it. */
+ 	data_len = new_off - change_off;
+ 	if ((bp + data_len + 2) - bstart >= result_max)
+ 		return false;
+ 
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 	/* If any left out old tuple data presents then copy it as history */
+ 	data_len = old_off - match_off;
+ 	len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 	if ((bp + len) - bstart >= result_max)
+ 		return false;
+ 
+ 	match_off += old_bitmaplen;
+ 	pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off);
+ 
+ 	/*
+ 	 * Write out the last control byte and check that we haven't overrun the
+ 	 * output size allowed by the strategy.
+ 	 */
+ 	*ctrlp = ctrlb;
+ 
+ 	result_size = bp - bstart;
+ 	if (result_size >= result_max)
+ 		return false;
+ 
+ 	/*
+ 	 * Success - need only fill in the actual length of the compressed datum.
+ 	 */
+ 	SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+ 	return true;
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *		Decodes the encoded data to dest tuple with the help of history.
+  *
+  *		encdata - Pointer to the encoded data.
+  *		oldtup - pointer to the history tuple.
+  *		newtup - pointer to the destination tuple.
+  * ----------------
+  */
+ void
+ heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	return pglz_decompress_with_history((char *)encdata,
+ 				(char *)newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 				&newtup->t_len,
+ 				(char *)oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
+ /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
   *		which are of the length indicated by tupleDescriptor->natts
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 85,90 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
--- 85,91 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+ 				HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
***************
*** 844,851 **** heapgettup_pagemode(HeapScanDesc scan,
   * definition in access/htup.h is maintained.
   */
  Datum
! fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
! 			bool *isnull)
  {
  	return (
  			(attnum) > 0 ?
--- 845,852 ----
   * definition in access/htup.h is maintained.
   */
  Datum
! fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
! 			bool *isnull, int32 *len)
  {
  	return (
  			(attnum) > 0 ?
***************
*** 855,877 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			 (
  			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
  			  (
! 			   fetchatt((tupleDesc)->attrs[(attnum) - 1],
  						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr((tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
  			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
  			  (
  			   (*(isnull) = true),
  			   (Datum) NULL
  			   )
  			  :
  			  (
! 			   nocachegetattr((tup), (attnum), (tupleDesc))
  			   )
  			  )
  			 )
--- 856,882 ----
  			 (
  			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
  			  (
! 				(*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
! 							(char *)(tup)->t_data + (tup)->t_data->t_hoff +
! 							(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
! 			    fetchatt((tupleDesc)->attrs[(attnum) - 1],
  						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
  			  )
  			 :
  			 (
  			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
  			  (
  			   (*(isnull) = true),
+ 			   (*(len) = 0),
  			   (Datum) NULL
  			   )
  			  :
  			  (
! 			   nocachegetattr_with_len((tup), (attnum), (tupleDesc), len)
  			   )
  			  )
  			 )
***************
*** 881,886 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
--- 886,903 ----
  			 )
  		);
  }
+ /*
+  * This is formatted so oddly so that the correspondence to the macro
+  * definition in access/htup.h is maintained.
+  */
+ Datum
+ fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+ 			bool *isnull)
+ {
+ 	int32 len;
+ 
+ 	return fastgetattr_with_len(tup, attnum, tupleDesc, isnull, &len);
+ }
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  
***************
*** 3200,3209 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3217,3228 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 
! 		recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 								 newbuf, heaptup, &oldtup,
! 								 all_visible_cleared,
! 								 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
***************
*** 3263,3346 **** l2:
  }
  
  /*
-  * Check if the specified attribute's value is same in both given tuples.
-  * Subroutine for HeapSatisfiesHOTUpdate.
-  */
- static bool
- heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
- 					   HeapTuple tup1, HeapTuple tup2)
- {
- 	Datum		value1,
- 				value2;
- 	bool		isnull1,
- 				isnull2;
- 	Form_pg_attribute att;
- 
- 	/*
- 	 * If it's a whole-tuple reference, say "not equal".  It's not really
- 	 * worth supporting this case, since it could only succeed after a no-op
- 	 * update, which is hardly a case worth optimizing for.
- 	 */
- 	if (attrnum == 0)
- 		return false;
- 
- 	/*
- 	 * Likewise, automatically say "not equal" for any system attribute other
- 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
- 	 * chain, or even to be set correctly yet in the new tuple.
- 	 */
- 	if (attrnum < 0)
- 	{
- 		if (attrnum != ObjectIdAttributeNumber &&
- 			attrnum != TableOidAttributeNumber)
- 			return false;
- 	}
- 
- 	/*
- 	 * Extract the corresponding values.  XXX this is pretty inefficient if
- 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
- 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
- 	 * work for system columns ...
- 	 */
- 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
- 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
- 
- 	/*
- 	 * If one value is NULL and other is not, then they are certainly not
- 	 * equal
- 	 */
- 	if (isnull1 != isnull2)
- 		return false;
- 
- 	/*
- 	 * If both are NULL, they can be considered equal.
- 	 */
- 	if (isnull1)
- 		return true;
- 
- 	/*
- 	 * We do simple binary comparison of the two datums.  This may be overly
- 	 * strict because there can be multiple binary representations for the
- 	 * same logical value.	But we should be OK as long as there are no false
- 	 * positives.  Using a type-specific equality operator is messy because
- 	 * there could be multiple notions of equality in different operator
- 	 * classes; furthermore, we cannot safely invoke user-defined functions
- 	 * while holding exclusive buffer lock.
- 	 */
- 	if (attrnum <= 0)
- 	{
- 		/* The only allowed system columns are OIDs, so do this */
- 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
- 	}
- 	else
- 	{
- 		Assert(attrnum <= tupdesc->natts);
- 		att = tupdesc->attrs[attrnum - 1];
- 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
- 	}
- }
- 
- /*
   * Check if the old and new tuples represent a HOT-safe update. To be able
   * to do a HOT update, we must not have changed any columns used in index
   * definitions.
--- 3282,3287 ----
***************
*** 4435,4441 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
--- 4376,4382 ----
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
***************
*** 4444,4449 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 4385,4401 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	int			oldtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds max output possible from the LZ algorithm */
+ 	struct
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[MaxHeapTupleSize];
+ 	}			buf;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 4453,4463 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4405,4445 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 	oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/* Is the update is going to the same page? */
+ 	if (oldbuf == newbuf)
+ 	{
+ 		/*
+ 		 * LZ algorithm can hold only history offset in the range of 1 - 4095.
+ 		 * so the delta encode is restricted for the tuples with length more
+ 		 * than PGLZ_HISTORY_SIZE.
+ 		 */
+ 		if (oldtuplen < PGLZ_HISTORY_SIZE)
+ 		{
+ 			/* Delta-encode the new tuple using the old tuple */
+ 			if (heap_delta_encode(reln->rd_att, oldtup, newtup,
+ 									&buf.pglzheader))
+ 			{
+ 				compressed = true;
+ 				newtupdata = (char *)&buf.pglzheader;
+ 				newtuplen = VARSIZE(&buf.pglzheader);
+ 			}
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 4484,4492 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 4466,4480 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * -----------
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
! 	 *					OR
! 	 * PG93FORMAT [If encoded]: LZ header + Encoded data
! 	 * -----------
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 5238,5244 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5226,5235 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 5253,5259 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5244,5250 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5296,5302 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
--- 5287,5293 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
***************
*** 5315,5321 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5306,5312 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5337,5343 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5328,5334 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5393,5402 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 5384,5412 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the new tuple was delta-encoded, decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/* PG93FORMAT: LZ header + Encoded data */
! 		PGLZ_Header *encoded_data = (PGLZ_Header   *)(((char *) xlrec) + hsize);
! 
! 		oldtup.t_data = oldtupdata;
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode(encoded_data, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 5411,5417 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 5421,5427 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 182,190 ****
   */
  #define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
  #define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
- #define PGLZ_HISTORY_SIZE		4096
- #define PGLZ_MAX_MATCH			273
- 
  
  /* ----------
   * PGLZ_HistEntry -
--- 182,187 ----
***************
*** 302,368 **** do {									\
  			}																\
  } while (0)
  
- 
- /* ----------
-  * pglz_out_ctrl -
-  *
-  *		Outputs the last and allocates a new control byte if needed.
-  * ----------
-  */
- #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
- do { \
- 	if ((__ctrl & 0xff) == 0)												\
- 	{																		\
- 		*(__ctrlp) = __ctrlb;												\
- 		__ctrlp = (__buf)++;												\
- 		__ctrlb = 0;														\
- 		__ctrl = 1;															\
- 	}																		\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_literal -
-  *
-  *		Outputs a literal byte to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	*(_buf)++ = (unsigned char)(_byte);										\
- 	_ctrl <<= 1;															\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_tag -
-  *
-  *		Outputs a backward reference tag of 2-4 bytes (depending on
-  *		offset and length) to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	_ctrlb |= _ctrl;														\
- 	_ctrl <<= 1;															\
- 	if (_len > 17)															\
- 	{																		\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
- 		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
- 		(_buf)[2] = (unsigned char)((_len) - 18);							\
- 		(_buf) += 3;														\
- 	} else {																\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
- 		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
- 		(_buf) += 2;														\
- 	}																		\
- } while (0)
- 
- 
  /* ----------
   * pglz_find_match -
   *
--- 299,304 ----
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 583,620 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest.
+  *		To decompress, it uses history if provided.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 699,726 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  					break;
  				}
  
! 				/*
! 				 * Now we copy the bytes specified by the tag from OUTPUT to
! 				 * OUTPUT. It is dangerous and platform dependent to use
! 				 * memcpy() here, because the copied areas could overlap
! 				 * extremely!
! 				 */
! 				while (len--)
  				{
! 					*dp = dp[-off];
! 					dp++;
  				}
  			}
  			else
  			{
! 				/*
! 				 * An unset control bit means LITERAL BYTE. So we just copy
! 				 * one from INPUT to OUTPUT.
! 				 */
! 				if (dp >= destend)		/* check for buffer overrun */
! 					break;		/* do not clobber memory */
! 
! 				*dp++ = *sp++;
  			}
  
  			/*
--- 658,735 ----
  					break;
  				}
  
! 				if (history)
! 				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from history to
! 					 * OUTPUT.
! 					 */
! 					memcpy(dp, history + off, len);
! 					dp += len;
! 				}
! 				else
  				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from OUTPUT to
! 					 * OUTPUT. It is dangerous and platform dependent to use
! 					 * memcpy() here, because the copied areas could overlap
! 					 * extremely!
! 					 */
! 					while (len--)
! 					{
! 						*dp = dp[-off];
! 						dp++;
! 					}
  				}
  			}
  			else
  			{
! 				if (history)
! 				{
! 					/*
! 					 * Otherwise it contains the match length minus 3 and the
! 					 * upper 4 bits of the offset. The next following byte
! 					 * contains the lower 8 bits of the offset. If the length is
! 					 * coded as 18, another extension tag byte tells how much
! 					 * longer the match really was (0-255).
! 					 */
! 					int32		len;
! 
! 					len = sp[0];
! 					sp += 1;
! 
! 					/*
! 					 * Check for output buffer overrun, to ensure we don't clobber
! 					 * memory in case of corrupt input.  Note: we must advance dp
! 					 * here to ensure the error is detected below the loop.  We
! 					 * don't simply put the elog inside the loop since that will
! 					 * probably interfere with optimization.
! 					 */
! 					if (dp + len > destend)
! 					{
! 						dp += len;
! 						break;
! 					}
! 
! 					/*
! 					 * Now we copy the bytes specified by the tag from Source to
! 					 * OUTPUT.
! 					 */
! 					memcpy(dp, sp, len);
! 					dp += len;
! 					sp += len;
! 				}
! 				else
! 				{
! 					/*
! 					 * An unset control bit means LITERAL BYTE. So we just copy
! 					 * one from INPUT to OUTPUT.
! 					 */
! 					if (dp >= destend)		/* check for buffer overrun */
! 						break;		/* do not clobber memory */
! 
! 					*dp++ = *sp++;
! 				}
  			}
  
  			/*
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 142,161 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	char		flags;			/* flag bits, see below */
! 
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! 
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01 /* Indicates as old page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02 /* Indicates as new page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04 /* Indicates as the update
! 												operation is delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 18,23 ****
--- 18,24 ----
  #include "access/tupdesc.h"
  #include "access/tupmacs.h"
  #include "storage/bufpage.h"
+ #include "utils/pg_lzcompress.h"
  
  /*
   * MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
***************
*** 528,533 **** struct MinimalTupleData
--- 529,535 ----
  		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
  
  
+ #if !defined(DISABLE_COMPLEX_MACRO)
  /* ----------------
   *		fastgetattr
   *
***************
*** 542,550 **** struct MinimalTupleData
   *		lookups, and call nocachegetattr() for the rest.
   * ----------------
   */
- 
- #if !defined(DISABLE_COMPLEX_MACRO)
- 
  #define fastgetattr(tup, attnum, tupleDesc, isnull)					\
  (																	\
  	AssertMacro((attnum) > 0),										\
--- 544,549 ----
***************
*** 572,585 **** struct MinimalTupleData
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
  )
- #else							/* defined(DISABLE_COMPLEX_MACRO) */
  
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
- 
  /* ----------------
   *		heap_getattr
   *
--- 571,626 ----
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
+ )																	\
+ 
+ /* ----------------
+  *		fastgetattr_with_len
+  *
+  *		Similar to fastgetattr and fetches the length of the given attribute
+  *		also.
+  * ----------------
+  */
+ #define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len)	\
+ (																	\
+ 	AssertMacro((attnum) > 0),										\
+ 	(*(isnull) = false),											\
+ 	HeapTupleNoNulls(tup) ? 										\
+ 	(																\
+ 		(tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ?			\
+ 		(															\
+ 			(*(len) = att_getlength(								\
+ 					(tupleDesc)->attrs[(attnum)-1]->attlen, 		\
+ 					(char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)),	\
+ 			fetchatt((tupleDesc)->attrs[(attnum)-1],				\
+ 				(char *) (tup)->t_data + (tup)->t_data->t_hoff +	\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)	\
+ 		)															\
+ 		:															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 	)																\
+ 	:																\
+ 	(																\
+ 		att_isnull((attnum)-1, (tup)->t_data->t_bits) ? 			\
+ 		(															\
+ 			(*(isnull) = true), 									\
+ 			(*(len) = 0),											\
+ 			(Datum)NULL 											\
+ 		)															\
+ 		:															\
+ 		(															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 		)															\
+ 	)																\
  )
  
+ #else							/* defined(DISABLE_COMPLEX_MACRO) */
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
+ extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+ 			TupleDesc tupleDesc, bool *isnull, int32 *len);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  /* ----------------
   *		heap_getattr
   *
***************
*** 596,616 **** extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
  	( \
! 		((attnum) > 0) ? \
  		( \
! 			((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
! 			( \
! 				(*(isnull) = true), \
! 				(Datum)NULL \
! 			) \
! 			: \
! 				fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
  		) \
  		: \
! 			heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	)
  
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
--- 637,679 ----
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
+ ( \
+ 	((attnum) > 0) ? \
  	( \
! 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
  		( \
! 			(*(isnull) = true), \
! 			(Datum)NULL \
  		) \
  		: \
! 			fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	) \
! 	: \
! 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
  
+ /* ----------------
+  *		heap_getattr_with_len
+  *
+  *		Similar to heap_getattr and outputs the length of the given attribute.
+  * ----------------
+  */
+ #define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ 	((attnum) > 0) ? \
+ 	( \
+ 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+ 		( \
+ 			(*(isnull) = true), \
+ 			(*(len) = 0), \
+ 			(Datum)NULL \
+ 		) \
+ 		: \
+ 			fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+ 	) \
+ 	: \
+ 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+ )
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
***************
*** 620,625 **** extern void heap_fill_tuple(TupleDesc tupleDesc,
--- 683,690 ----
  				char *data, Size data_size,
  				uint16 *infomask, bits8 *bit);
  extern bool heap_attisnull(HeapTuple tup, int attnum);
+ extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+ 			   TupleDesc att, int32 *len);
  extern Datum nocachegetattr(HeapTuple tup, int attnum,
  			   TupleDesc att);
  extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
***************
*** 636,641 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 701,713 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
+ 					   HeapTuple tup1, HeapTuple tup2);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				HeapTuple newtup, PGLZ_Header *encdata);
+ extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+ 				HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/tupmacs.h
--- b/src/include/access/tupmacs.h
***************
*** 187,192 ****
--- 187,214 ----
  )
  
  /*
+  * att_getlength -
+  *			Gets the length of the attribute.
+  */
+ #define att_getlength(attlen, attptr) \
+ ( \
+ 	((attlen) > 0) ? \
+ 	( \
+ 		(attlen) \
+ 	) \
+ 	: (((attlen) == -1) ? \
+ 	( \
+ 		VARSIZE_ANY(attptr) \
+ 	) \
+ 	: \
+ 	( \
+ 		AssertMacro((attlen) == -2), \
+ 		(strlen((char *) (attptr)) + 1) \
+ 	)) \
+ )
+ 
+ 
+ /*
   * store_att_byval is a partial inverse of fetch_att: store a given Datum
   * value into a tuple data area at the specified address.  However, it only
   * handles the byval case, because in typical usage the caller needs to
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 23,28 **** typedef struct PGLZ_Header
--- 23,30 ----
  	int32		rawsize;
  } PGLZ_Header;
  
+ #define PGLZ_HISTORY_SIZE		4096
+ #define PGLZ_MAX_MATCH			273
  
  /* ----------
   * PGLZ_MAX_OUTPUT -
***************
*** 86,91 **** typedef struct PGLZ_Strategy
--- 88,196 ----
  	int32		match_size_drop;
  } PGLZ_Strategy;
  
+ /*
+  * calculate the approximate length required for history encode tag for the
+  * given length
+  */
+ #define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+ ( \
+ 	((_len) < 17) ?	(3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH)))	\
+ )
+ 
+ /* ----------
+  * pglz_out_ctrl -
+  *
+  *		Outputs the last and allocates a new control byte if needed.
+  * ----------
+  */
+ #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+ do { \
+ 	if ((__ctrl & 0xff) == 0)												\
+ 	{																		\
+ 		*(__ctrlp) = __ctrlb;												\
+ 		__ctrlp = (__buf)++;												\
+ 		__ctrlb = 0;														\
+ 		__ctrl = 1; 														\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_literal -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+ do { \
+ 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 	*(_buf)++ = (unsigned char)(_byte); 									\
+ 	_ctrl <<= 1;															\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_tag -
+  *
+  *		Outputs a backward/history reference tag of 2-4 bytes (depending on
+  *		offset and length) to the destination buffer including the
+  *		appropriate control bit.
+  *
+  *		Split the process of backward/history reference as different chunks,
+  *		if the given lenght is more than max match and repeats the process
+  *		until the given length is processed.
+  * ----------
+  */
+ #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+ do { \
+ 	int _mlen;																	\
+ 	int _total_len = (_len);													\
+ 	while (_total_len > 0)														\
+ 	{																			\
+ 		_mlen = _total_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _total_len;		\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 		_ctrlb |= _ctrl;														\
+ 		_ctrl <<= 1;															\
+ 		if (_mlen > 17)															\
+ 		{																		\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+ 			(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+ 			(_buf)[2] = (unsigned char)((_mlen) - 18);							\
+ 			(_buf) += 3;														\
+ 		} else {																\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mlen) - 3)); \
+ 			(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+ 			(_buf) += 2;														\
+ 		}																		\
+ 		_total_len -= _mlen;													\
+ 		(_off) += _mlen;														\
+ 	}																			\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_add -
+  *
+  *		Outputs a reference tag of 1 byte with length and the new data
+  *		to the destination buffer, including the appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ 	int32 _mlen;																\
+ 	int32 _total_len = (_len);															\
+ 	while (_total_len > 0)														\
+ 	{																		\
+ 		_mlen = _total_len > 255 ? 255 : _total_len;								\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);							\
+ 		_ctrl <<= 1;														\
+ 		(_buf)[0] = (unsigned char)(_mlen);									\
+ 		(_buf) += 1;														\
+ 		memcpy((_buf), (_byte), _mlen);										\
+ 		(_buf) += _mlen;													\
+ 		(_byte) += _mlen;													\
+ 		_total_len -= _mlen;													\
+ 	}																		\
+ } while (0)
+ 
  
  /* ----------
   * The standard strategies
***************
*** 108,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! 
  #endif   /* _PG_LZCOMPRESS_H_ */
--- 213,218 ----
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! extern void pglz_decompress_with_history(const char *source, char *dest,
! 				uint32 *destlen, const char *history);
  #endif   /* _PG_LZCOMPRESS_H_ */

Amit kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Amit kapila (#1)

5 attachment(s)

On Thu, 8 Nov 2012 17:33:54 +0000 Amit Kapila wrote:
On Mon, 29 Oct 2012 20:02:11 +0530 Amit Kapila wrote:
On Sunday, October 28, 2012 12:28 AM Heikki Linnakangas wrote:

One idea is to use the LZ format in the WAL record, but use your
memcmp() code to construct it. I believe the slow part in LZ compression
is in trying to locate matches in the "history", so if you just replace
that with your code that's aware of the column boundaries and uses
simple memcmp() to detect what parts changed, you could create LZ
compressed output just as quickly as the custom encoded format. It would
leave the door open for making the encoding smarter or to do actual
compression in the future, without changing the format and the code to
decode it.

This is good idea. I shall try it.

In the existing algorithm for storing the new data which is not present in
the history, it needs 1 control byte for
every 8 bytes of new data which can increase the size of the compressed
output as compare to our delta encoding approach.

Approach-2

---------------

Use only one bit for control data [0 - Length and new data, 1 - pick from
history based on OFFSET-LENGTH]
The modified bit value (0) is to handle the new field data as a continuous
stream of data, instead of treating every byte as a new data.

Attached are the patches
1. wal_update_changes_lz_v4 - to use LZ Approach with memcmp to construct WAL record
2. wal_update_changes_modified_lz_v5 - to use modified LZ Approach as mentioned above as Approach-2

The main Changes as compare to previous patch are as follows:
1. In heap_delta_encode, use LZ encoding instead of Custom encoding.
2. Instead of get_tup_info(), introduced heap_getattr_with_len() macro based on suggestion from Noah.
3. LZ macro's moved from .c to .h, as they need to be used for encoding.
4. Changed the format for function arguments for heap_delta_encode()/heap_delta_decode() based on suggestion from Noah.

Please find the updated patches attached with this mail.

Modification in these Patches apart from above:

1. Traverse the tuple only once (previously it needs to traverse 3 times) to check if particular offset matches and get the offset to generate encoded tuple.

To achieve this I have modified function heap_tuple_attr_equals() to heap_attr_get_length_and_check_equals(), so that it can get the length of tuple attribute

which can be used to calculate offset. A separate function can also be written to achieve the same.

2. Improve the comments in code.

Performance Data:

1. Please refer testcase in attached file pgbench_250.c

Refer Function used to create random string at end of mail.

2. The detail data and configuration settings can be reffered in attached files (pgbench_encode_withlz_ff100 & pgbench_encode_withlz_ff80).

Benchmark results with -F 100:

-Patch- -tps@-c1- -tps@-c2- -tps@-c4- -tps@-c8- -WAL@-c8-
xlogscale 802 1453 2253 2643 13.99 GB
xlogscale+org lz 807 1602 3168 5140 9.50 GB
xlogscale+mod lz 796 1620 3216 5270 9.16 GB

Benchmark results with -F 80:

-Patch- -tps@-c1- -tps@-c2- -tps@-c4- -tps@-c8- -WAL@-c8-
xlogscale 811 1455 2148 2704 13.6 GB
xlogscale+org lz 829 1684 3223 5325 9.13 GB
xlogscale+mod lz 801 1657 3263 5488 8.86 GB

I shall write the wal_update_changes_custom_delta_v6, and then we can compare all the three patches performance data and decide which one to go based on results.

The results with this are not better than above 2 Approaches, so I am not attaching it.

Function used to create randome string

--------------------------------------------------------

CREATE OR REPLACE FUNCTION random_text_md5_v2(INTEGER)
RETURNS TEXT
LANGUAGE SQL
AS $$

select upper(
substring(
(
SELECT string_agg(md5(random()::TEXT), '')
FROM generate_series(1, CEIL($1 / 32.)::integer)
),
$1)
);

$$;

Suggestions/Comments?

With Regards,

Amit Kapila.

Attachments:

wal_update_changes_lz_v4.patchapplication/octet-stream; name=wal_update_changes_lz_v4.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,65 ****
--- 60,66 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
  
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
***************
*** 297,308 **** heap_attisnull(HeapTuple tup, int attnum)
  }
  
  /* ----------------
!  *		nocachegetattr
   *
!  *		This only gets called from fastgetattr() macro, in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
--- 298,310 ----
  }
  
  /* ----------------
!  *		nocachegetattr_with_len
   *
!  *		This only gets called in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor and
!  *		outputs the length of the attribute value.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
***************
*** 320,328 **** heap_attisnull(HeapTuple tup, int attnum)
   * ----------------
   */
  Datum
! nocachegetattr(HeapTuple tuple,
  			   int attnum,
! 			   TupleDesc tupleDesc)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
--- 322,331 ----
   * ----------------
   */
  Datum
! nocachegetattr_with_len(HeapTuple tuple,
  			   int attnum,
! 			   TupleDesc tupleDesc,
! 			   int32	*len)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
***************
*** 381,386 **** nocachegetattr(HeapTuple tuple,
--- 384,391 ----
  		 */
  		if (att[attnum]->attcacheoff >= 0)
  		{
+ 			*len = att_getlength(att[attnum]->attlen,
+ 								tp + att[attnum]->attcacheoff);
  			return fetchatt(att[attnum],
  							tp + att[attnum]->attcacheoff);
  		}
***************
*** 507,515 **** nocachegetattr(HeapTuple tuple,
--- 512,534 ----
  		}
  	}
  
+ 	*len = att_getlength(att[attnum]->attlen, tp + off);
  	return fetchatt(att[attnum], tp + off);
  }
  
+ /*
+  *	nocachegetattr
+  */
+ Datum
+ nocachegetattr(HeapTuple tuple,
+ 			   int attnum,
+ 			   TupleDesc tupleDesc)
+ {
+ 	int32 len;
+ 
+ 	return nocachegetattr_with_len(tuple, attnum, tupleDesc, &len);
+ }
+ 
  /* ----------------
   *		heap_getsysattr
   *
***************
*** 618,623 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 637,1012 ----
  }
  
  /*
+  * Check if the specified attribute's value is same in both given tuples.
+  * and outputs the length of the given attribute in both tuples.
+  */
+ bool
+ heap_attr_get_length_and_check_equals(TupleDesc tupdesc, int attrnum,
+ 					   HeapTuple tup1, HeapTuple tup2,
+ 					   int32 *tup1_attr_len, int32 *tup2_attr_len)
+ {
+ 	Datum		value1,
+ 				value2;
+ 	bool		isnull1,
+ 				isnull2;
+ 	Form_pg_attribute att;
+ 
+ 	*tup1_attr_len = 0;
+ 	*tup2_attr_len = 0;
+ 
+ 	/*
+ 	 * If it's a whole-tuple reference, say "not equal".  It's not really
+ 	 * worth supporting this case, since it could only succeed after a no-op
+ 	 * update, which is hardly a case worth optimizing for.
+ 	 */
+ 	if (attrnum == 0)
+ 		return false;
+ 
+ 	/*
+ 	 * Likewise, automatically say "not equal" for any system attribute other
+ 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ 	 * chain, or even to be set correctly yet in the new tuple.
+ 	 */
+ 	if (attrnum < 0)
+ 	{
+ 		if (attrnum != ObjectIdAttributeNumber &&
+ 			attrnum != TableOidAttributeNumber)
+ 			return false;
+ 	}
+ 
+ 	/*
+ 	 * Extract the corresponding values.  XXX this is pretty inefficient if
+ 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
+ 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
+ 	 * work for system columns ...
+ 	 */
+ 	value1 = heap_getattr_with_len(tup1, attrnum, tupdesc, &isnull1, tup1_attr_len);
+ 	value2 = heap_getattr_with_len(tup2, attrnum, tupdesc, &isnull2, tup2_attr_len);
+ 
+ 	/*
+ 	 * If one value is NULL and other is not, then they are certainly not
+ 	 * equal
+ 	 */
+ 	if (isnull1 != isnull2)
+ 		return false;
+ 
+ 	/*
+ 	 * If both are NULL, they can be considered equal.
+ 	 */
+ 	if (isnull1)
+ 		return true;
+ 
+ 	/*
+ 	 * We do simple binary comparison of the two datums.  This may be overly
+ 	 * strict because there can be multiple binary representations for the
+ 	 * same logical value.	But we should be OK as long as there are no false
+ 	 * positives.  Using a type-specific equality operator is messy because
+ 	 * there could be multiple notions of equality in different operator
+ 	 * classes; furthermore, we cannot safely invoke user-defined functions
+ 	 * while holding exclusive buffer lock.
+ 	 */
+ 	if (attrnum <= 0)
+ 	{
+ 		/* The only allowed system columns are OIDs, so do this */
+ 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ 	}
+ 	else
+ 	{
+ 		Assert(attrnum <= tupdesc->natts);
+ 		att = tupdesc->attrs[attrnum - 1];
+ 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ 	}
+ }
+ 
+ /* ----------------
+  * heap_delta_encode
+  *		Forms an encoded data from old and new tuple with the modified columns
+  *		using an algorithm similar to LZ algorithm.
+  *
+  *		tupleDesc - Tuple descriptor.
+  *		oldtup - pointer to the old/history tuple.
+  *		newtup - pointer to the new tuple.
+  *		encdata - pointer to the encoded data using lz algorithm.
+  *
+  *      Encode the bitmap [+padding] [+oid] as a new data. And loop for all
+  *		attributes to find any modifications in the attributes.
+  *
+  *		The unmodified data is encoded as a history tag to the output and the
+  *      modifed data is encoded as new data to the output.
+  *
+  *		If the encoded output data is less than 75% of original data,
+  *		The output data is considered as encoded and proceed further.
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  PGLZ_Header *encdata)
+ {
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	int			numberOfAttributes;
+ 	int32		new_off = 0,
+ 				old_off = 0,
+ 				temp_off = 0,
+ 				match_off = 0,
+ 				change_off = 0;
+ 	int			attnum;
+ 	int32		data_len,
+ 				old_pad_len,
+ 				new_pad_len,
+ 				old_attr_len,
+ 				new_attr_len;
+ 	bool		match_not_found = false;
+ 	unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+ 	unsigned char *bstart = bp;
+ 	char *dp = (char *)newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	char *dstart = dp;
+ 	unsigned char ctrl_dummy = 0;
+ 	unsigned char *ctrlp = &ctrl_dummy;
+ 	unsigned char ctrlb = 0;
+ 	unsigned char ctrl = 0;
+ 	int32 		len,
+ 				old_bitmaplen,
+ 				new_bitmaplen,
+ 				new_tup_len;
+ 	int32		result_size;
+ 	int32		result_max;
+ 
+ 	/* Include the bitmap header in the lz encoded data. */
+ 	old_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * The maximum encoded data is of 75% of total size. The max tuple size
+ 	 * is already validated as it cannot be more than MaxHeapTupleSize.
+ 	 */
+ 	result_max = (new_tup_len * 75) / 100;
+ 	encdata->rawsize = new_tup_len;
+ 
+ 	/*
+ 	 * Check for output buffer is reached the result_max by advancing
+ 	 * the buffer by the calculated aproximate length for the
+ 	 * corresponding operation.
+ 	 */
+ 	if ((bp + (2 * new_bitmaplen)) - bstart >= result_max)
+ 		return false;
+ 
+ 	/* Copy the bitmap data from new tuple to the encoded data buffer */
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_bitmaplen, dp);
+ 	dstart = dp;
+ 
+ 	numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+ 	for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+ 	{
+ 		/*
+ 		 * If the attribute is modified by the update operation, store the
+ 		 * appropiate offsets in the WAL record, otherwise skip to the next
+ 		 * attribute.
+ 		 */
+ 		if (!heap_attr_get_length_and_check_equals(tupleDesc, attnum, oldtup,
+ 							newtup, &old_attr_len, &new_attr_len))
+ 		{
+ 			match_not_found = true;
+ 			data_len = old_off - match_off;
+ 
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 			if ((bp + len) - bstart >= result_max)
+ 				return false;
+ 
+ 			/*
+ 			 * The match_off value is calculated w.r.t to the tuple t_hoff
+ 			 * value, the bit map len needs to be added to match_off to get the
+ 			 * actual start offfset from the old/history tuple.
+ 			 */
+ 			match_off += old_bitmaplen;
+ 
+ 			/*
+ 			 * If any unchanged data presents in the old and new tuples then
+ 			 * encode the data as it needs to copy from history tuple with len
+ 			 * and offset.
+ 			 */
+ 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off);
+ 
+ 			/*
+ 			 * Recalculate the old and new tuple offsets based on padding
+ 			 * present in the tuples
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				old_off = att_align_pointer(old_off,
+ 								  att[attnum - 1]->attalign,
+ 								  att[attnum - 1]->attlen,
+ 					(char *)oldtup->t_data + oldtup->t_data->t_hoff + old_off);
+ 			}
+ 
+ 			if (!HeapTupleHasNulls(newtup)
+ 				|| !att_isnull((attnum - 1), newtup->t_data->t_bits))
+ 			{
+ 				new_off = att_align_pointer(new_off,
+ 									  att[attnum - 1]->attalign,
+ 									  att[attnum - 1]->attlen,
+ 					(char *)newtup->t_data + newtup->t_data->t_hoff + new_off);
+ 			}
+ 
+ 			old_off += old_attr_len;
+ 			new_off += new_attr_len;
+ 
+ 			match_off = old_off;
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			data_len = new_off - change_off;
+ 			if ((bp + (2 * data_len)) - bstart >= result_max)
+ 				return false;
+ 
+ 			/* Copy the modified column data to the output buffer if present */
+ 			pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 			/*
+ 			 * calculate the old tuple field start position, required to
+ 			 * ignore if any alignmet is present.
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				temp_off = old_off;
+ 				old_off = att_align_pointer(old_off,
+ 								  att[attnum - 1]->attalign,
+ 								  att[attnum - 1]->attlen,
+ 				 (char *)oldtup->t_data + oldtup->t_data->t_hoff + old_off);
+ 
+ 				old_pad_len = old_off - temp_off;
+ 
+ 				/*
+ 				 * calculate the new tuple field start position to check whether
+ 				 * any padding is required or not because field alignment.
+ 				 */
+ 				temp_off = new_off;
+ 				new_off = att_align_pointer(new_off,
+ 									  att[attnum - 1]->attalign,
+ 									  att[attnum - 1]->attlen,
+ 					(char *)newtup->t_data + newtup->t_data->t_hoff + new_off);
+ 				new_pad_len = new_off - temp_off;
+ 
+ 				/*
+ 				 * Checking for that is there any alignment difference between
+ 				 * old and new tuple attributes.
+ 				 */
+ 				if (old_pad_len != new_pad_len)
+ 				{
+ 					/*
+ 					 * If the alignment difference is found between old and new
+ 					 * tuples and the last attribute value of the new tuple is
+ 					 * same as old tuple then write the encode as history data
+ 					 * until the current match.
+ 					 *
+ 					 * If the last attribute value of new tuple is not same as
+ 					 * old tuple then the matched data marking as history is
+ 					 * already taken care.
+ 					 */
+ 					if (!match_not_found)
+ 					{
+ 						/*
+ 						 * Check for output buffer is reached the result_max by
+ 						 * advancing the buffer by the calculated aproximate
+ 						 * length for the corresponding operation.
+ 						 */
+ 						data_len = old_off - old_pad_len - match_off;
+ 						len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 						if ((bp + len) - bstart >= result_max)
+ 							return false;
+ 
+ 						match_off += old_bitmaplen;
+ 						pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off);
+ 					}
+ 
+ 					match_off = old_off;
+ 
+ 					/* Alignment data */
+ 					if ((bp + (2 * new_pad_len)) - bstart >= result_max)
+ 						return false;
+ 
+ 					pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_pad_len, dp);
+ 				}
+ 			}
+ 
+ 			old_off += old_attr_len;
+ 			new_off += new_attr_len;
+ 
+ 			change_off = new_off;
+ 
+ 			/*
+ 			 * Recalculate the destination pointer with the new offset which is
+ 			 * used while copying the modified data.
+ 			 */
+ 			dp = dstart + new_off;
+ 			match_not_found = false;
+ 		}
+ 	}
+ 
+ 	/* If any modified column data presents then copy it. */
+ 	data_len = new_off - change_off;
+ 	if ((bp + (2 * data_len)) - bstart >= result_max)
+ 		return false;
+ 
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 	/* If any left out old tuple data presents then copy it as history */
+ 	data_len = old_off - match_off;
+ 	len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 	if ((bp + len) - bstart >= result_max)
+ 		return false;
+ 
+ 	match_off += old_bitmaplen;
+ 	pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off);
+ 
+ 	/*
+ 	 * Write out the last control byte and check that we haven't overrun the
+ 	 * output size allowed by the strategy.
+ 	 */
+ 	*ctrlp = ctrlb;
+ 
+ 	result_size = bp - bstart;
+ 	if (result_size >= result_max)
+ 		return false;
+ 
+ 	/*
+ 	 * Success - need only fill in the actual length of the compressed datum.
+ 	 */
+ 	SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+ 	return true;
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *		Decodes the encoded data to dest tuple with the help of history.
+  *
+  *		encdata - Pointer to the encoded data.
+  *		oldtup - pointer to the history tuple.
+  *		newtup - pointer to the destination tuple.
+  * ----------------
+  */
+ void
+ heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	return pglz_decompress_with_history((char *)encdata,
+ 				(char *)newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 				&newtup->t_len,
+ 				(char *)oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
+ /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
   *		which are of the length indicated by tupleDescriptor->natts
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 85,90 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
--- 85,91 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+ 				HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
***************
*** 844,851 **** heapgettup_pagemode(HeapScanDesc scan,
   * definition in access/htup.h is maintained.
   */
  Datum
! fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
! 			bool *isnull)
  {
  	return (
  			(attnum) > 0 ?
--- 845,852 ----
   * definition in access/htup.h is maintained.
   */
  Datum
! fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
! 			bool *isnull, int32 *len)
  {
  	return (
  			(attnum) > 0 ?
***************
*** 855,877 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			 (
  			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
  			  (
! 			   fetchatt((tupleDesc)->attrs[(attnum) - 1],
  						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr((tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
  			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
  			  (
  			   (*(isnull) = true),
  			   (Datum) NULL
  			   )
  			  :
  			  (
! 			   nocachegetattr((tup), (attnum), (tupleDesc))
  			   )
  			  )
  			 )
--- 856,882 ----
  			 (
  			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
  			  (
! 				(*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
! 							(char *)(tup)->t_data + (tup)->t_data->t_hoff +
! 							(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
! 			    fetchatt((tupleDesc)->attrs[(attnum) - 1],
  						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
  			  )
  			 :
  			 (
  			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
  			  (
  			   (*(isnull) = true),
+ 			   (*(len) = 0),
  			   (Datum) NULL
  			   )
  			  :
  			  (
! 			   nocachegetattr_with_len((tup), (attnum), (tupleDesc), len)
  			   )
  			  )
  			 )
***************
*** 881,886 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
--- 886,903 ----
  			 )
  		);
  }
+ /*
+  * This is formatted so oddly so that the correspondence to the macro
+  * definition in access/htup.h is maintained.
+  */
+ Datum
+ fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+ 			bool *isnull)
+ {
+ 	int32 len;
+ 
+ 	return fastgetattr_with_len(tup, attnum, tupleDesc, isnull, &len);
+ }
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  
***************
*** 3200,3209 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3217,3228 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 
! 		recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 								 newbuf, heaptup, &oldtup,
! 								 all_visible_cleared,
! 								 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
***************
*** 3270,3343 **** static bool
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Datum		value1,
! 				value2;
! 	bool		isnull1,
! 				isnull2;
! 	Form_pg_attribute att;
! 
! 	/*
! 	 * If it's a whole-tuple reference, say "not equal".  It's not really
! 	 * worth supporting this case, since it could only succeed after a no-op
! 	 * update, which is hardly a case worth optimizing for.
! 	 */
! 	if (attrnum == 0)
! 		return false;
  
! 	/*
! 	 * Likewise, automatically say "not equal" for any system attribute other
! 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
! 	 * chain, or even to be set correctly yet in the new tuple.
! 	 */
! 	if (attrnum < 0)
! 	{
! 		if (attrnum != ObjectIdAttributeNumber &&
! 			attrnum != TableOidAttributeNumber)
! 			return false;
! 	}
! 
! 	/*
! 	 * Extract the corresponding values.  XXX this is pretty inefficient if
! 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
! 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
! 	 * work for system columns ...
! 	 */
! 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
! 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
! 
! 	/*
! 	 * If one value is NULL and other is not, then they are certainly not
! 	 * equal
! 	 */
! 	if (isnull1 != isnull2)
! 		return false;
! 
! 	/*
! 	 * If both are NULL, they can be considered equal.
! 	 */
! 	if (isnull1)
! 		return true;
! 
! 	/*
! 	 * We do simple binary comparison of the two datums.  This may be overly
! 	 * strict because there can be multiple binary representations for the
! 	 * same logical value.	But we should be OK as long as there are no false
! 	 * positives.  Using a type-specific equality operator is messy because
! 	 * there could be multiple notions of equality in different operator
! 	 * classes; furthermore, we cannot safely invoke user-defined functions
! 	 * while holding exclusive buffer lock.
! 	 */
! 	if (attrnum <= 0)
! 	{
! 		/* The only allowed system columns are OIDs, so do this */
! 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
! 	}
! 	else
! 	{
! 		Assert(attrnum <= tupdesc->natts);
! 		att = tupdesc->attrs[attrnum - 1];
! 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
! 	}
  }
  
  /*
--- 3289,3299 ----
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	int32 tup1_attr_len,
! 		  tup2_attr_len;
  
! 	return heap_attr_get_length_and_check_equals(tupdesc, attrnum, tup1, tup2,
! 											 &tup1_attr_len, &tup2_attr_len);
  }
  
  /*
***************
*** 4435,4441 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
--- 4391,4397 ----
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
***************
*** 4444,4449 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 4400,4416 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	int			oldtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds max output possible from the LZ algorithm */
+ 	struct
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[MaxHeapTupleSize];
+ 	}			buf;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 4453,4463 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4420,4460 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 	oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/* Is the update is going to the same page? */
+ 	if (oldbuf == newbuf)
+ 	{
+ 		/*
+ 		 * LZ algorithm can hold only history offset in the range of 1 - 4095.
+ 		 * so the delta encode is restricted for the tuples with length more
+ 		 * than PGLZ_HISTORY_SIZE.
+ 		 */
+ 		if (oldtuplen < PGLZ_HISTORY_SIZE)
+ 		{
+ 			/* Delta-encode the new tuple using the old tuple */
+ 			if (heap_delta_encode(reln->rd_att, oldtup, newtup,
+ 									&buf.pglzheader))
+ 			{
+ 				compressed = true;
+ 				newtupdata = (char *)&buf.pglzheader;
+ 				newtuplen = VARSIZE(&buf.pglzheader);
+ 			}
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 4484,4492 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 4481,4495 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * -----------
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
! 	 *					OR
! 	 * PG93FORMAT [If encoded]: LZ header + Encoded data
! 	 * -----------
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 5262,5268 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5265,5274 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 5277,5283 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5283,5289 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5337,5343 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
--- 5343,5349 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
***************
*** 5356,5362 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5362,5368 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5381,5387 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5387,5393 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5444,5453 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 5450,5478 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the new tuple was delta-encoded, decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/* PG93FORMAT: LZ header + Encoded data */
! 		PGLZ_Header *encoded_data = (PGLZ_Header   *)(((char *) xlrec) + hsize);
! 
! 		oldtup.t_data = oldtupdata;
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode(encoded_data, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 5462,5468 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 5487,5493 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 182,190 ****
   */
  #define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
  #define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
- #define PGLZ_HISTORY_SIZE		4096
- #define PGLZ_MAX_MATCH			273
- 
  
  /* ----------
   * PGLZ_HistEntry -
--- 182,187 ----
***************
*** 302,368 **** do {									\
  			}																\
  } while (0)
  
- 
- /* ----------
-  * pglz_out_ctrl -
-  *
-  *		Outputs the last and allocates a new control byte if needed.
-  * ----------
-  */
- #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
- do { \
- 	if ((__ctrl & 0xff) == 0)												\
- 	{																		\
- 		*(__ctrlp) = __ctrlb;												\
- 		__ctrlp = (__buf)++;												\
- 		__ctrlb = 0;														\
- 		__ctrl = 1;															\
- 	}																		\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_literal -
-  *
-  *		Outputs a literal byte to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	*(_buf)++ = (unsigned char)(_byte);										\
- 	_ctrl <<= 1;															\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_tag -
-  *
-  *		Outputs a backward reference tag of 2-4 bytes (depending on
-  *		offset and length) to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	_ctrlb |= _ctrl;														\
- 	_ctrl <<= 1;															\
- 	if (_len > 17)															\
- 	{																		\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
- 		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
- 		(_buf)[2] = (unsigned char)((_len) - 18);							\
- 		(_buf) += 3;														\
- 	} else {																\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
- 		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
- 		(_buf) += 2;														\
- 	}																		\
- } while (0)
- 
- 
  /* ----------
   * pglz_find_match -
   *
--- 299,304 ----
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 583,620 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest.
+  *		To decompress, it uses history if provided.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 699,714 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  					break;
  				}
  
! 				/*
! 				 * Now we copy the bytes specified by the tag from OUTPUT to
! 				 * OUTPUT. It is dangerous and platform dependent to use
! 				 * memcpy() here, because the copied areas could overlap
! 				 * extremely!
! 				 */
! 				while (len--)
  				{
! 					*dp = dp[-off];
! 					dp++;
  				}
  			}
  			else
--- 658,685 ----
  					break;
  				}
  
! 				if (history)
! 				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from history to
! 					 * OUTPUT.
! 					 */
! 					memcpy(dp, history + off, len);
! 					dp += len;
! 				}
! 				else
  				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from OUTPUT to
! 					 * OUTPUT. It is dangerous and platform dependent to use
! 					 * memcpy() here, because the copied areas could overlap
! 					 * extremely!
! 					 */
! 					while (len--)
! 					{
! 						*dp = dp[-off];
! 						dp++;
! 					}
  				}
  			}
  			else
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 142,161 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	char		flags;			/* flag bits, see below */
! 
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! 
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01 /* Indicates as old page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02 /* Indicates as new page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04 /* Indicates as the update
! 												operation is delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 18,23 ****
--- 18,24 ----
  #include "access/tupdesc.h"
  #include "access/tupmacs.h"
  #include "storage/bufpage.h"
+ #include "utils/pg_lzcompress.h"
  
  /*
   * MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
***************
*** 528,533 **** struct MinimalTupleData
--- 529,535 ----
  		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
  
  
+ #if !defined(DISABLE_COMPLEX_MACRO)
  /* ----------------
   *		fastgetattr
   *
***************
*** 542,550 **** struct MinimalTupleData
   *		lookups, and call nocachegetattr() for the rest.
   * ----------------
   */
- 
- #if !defined(DISABLE_COMPLEX_MACRO)
- 
  #define fastgetattr(tup, attnum, tupleDesc, isnull)					\
  (																	\
  	AssertMacro((attnum) > 0),										\
--- 544,549 ----
***************
*** 572,585 **** struct MinimalTupleData
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
  )
- #else							/* defined(DISABLE_COMPLEX_MACRO) */
  
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
- 
  /* ----------------
   *		heap_getattr
   *
--- 571,626 ----
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
+ )																	\
+ 
+ /* ----------------
+  *		fastgetattr_with_len
+  *
+  *		Similar to fastgetattr and fetches the length of the given attribute
+  *		also.
+  * ----------------
+  */
+ #define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len)	\
+ (																	\
+ 	AssertMacro((attnum) > 0),										\
+ 	(*(isnull) = false),											\
+ 	HeapTupleNoNulls(tup) ? 										\
+ 	(																\
+ 		(tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ?			\
+ 		(															\
+ 			(*(len) = att_getlength(								\
+ 					(tupleDesc)->attrs[(attnum)-1]->attlen, 		\
+ 					(char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)),	\
+ 			fetchatt((tupleDesc)->attrs[(attnum)-1],				\
+ 				(char *) (tup)->t_data + (tup)->t_data->t_hoff +	\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)	\
+ 		)															\
+ 		:															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 	)																\
+ 	:																\
+ 	(																\
+ 		att_isnull((attnum)-1, (tup)->t_data->t_bits) ? 			\
+ 		(															\
+ 			(*(isnull) = true), 									\
+ 			(*(len) = 0),											\
+ 			(Datum)NULL 											\
+ 		)															\
+ 		:															\
+ 		(															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 		)															\
+ 	)																\
  )
  
+ #else							/* defined(DISABLE_COMPLEX_MACRO) */
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
+ extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+ 			TupleDesc tupleDesc, bool *isnull, int32 *len);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  /* ----------------
   *		heap_getattr
   *
***************
*** 596,616 **** extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
  	( \
! 		((attnum) > 0) ? \
  		( \
! 			((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
! 			( \
! 				(*(isnull) = true), \
! 				(Datum)NULL \
! 			) \
! 			: \
! 				fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
  		) \
  		: \
! 			heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	)
  
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
--- 637,679 ----
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
+ ( \
+ 	((attnum) > 0) ? \
  	( \
! 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
  		( \
! 			(*(isnull) = true), \
! 			(Datum)NULL \
  		) \
  		: \
! 			fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	) \
! 	: \
! 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
  
+ /* ----------------
+  *		heap_getattr_with_len
+  *
+  *		Similar to heap_getattr and outputs the length of the given attribute.
+  * ----------------
+  */
+ #define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ 	((attnum) > 0) ? \
+ 	( \
+ 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+ 		( \
+ 			(*(isnull) = true), \
+ 			(*(len) = 0), \
+ 			(Datum)NULL \
+ 		) \
+ 		: \
+ 			fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+ 	) \
+ 	: \
+ 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+ )
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
***************
*** 620,625 **** extern void heap_fill_tuple(TupleDesc tupleDesc,
--- 683,690 ----
  				char *data, Size data_size,
  				uint16 *infomask, bits8 *bit);
  extern bool heap_attisnull(HeapTuple tup, int attnum);
+ extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+ 			   TupleDesc att, int32 *len);
  extern Datum nocachegetattr(HeapTuple tup, int attnum,
  			   TupleDesc att);
  extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
***************
*** 636,641 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 701,714 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_attr_get_length_and_check_equals(TupleDesc tupdesc,
+ 						int attrnum, HeapTuple tup1, HeapTuple tup2,
+ 						int32 *tup1_attr_len, int32 *tup2_attr_len);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				HeapTuple newtup, PGLZ_Header *encdata);
+ extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+ 				HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/tupmacs.h
--- b/src/include/access/tupmacs.h
***************
*** 187,192 ****
--- 187,214 ----
  )
  
  /*
+  * att_getlength -
+  *			Gets the length of the attribute.
+  */
+ #define att_getlength(attlen, attptr) \
+ ( \
+ 	((attlen) > 0) ? \
+ 	( \
+ 		(attlen) \
+ 	) \
+ 	: (((attlen) == -1) ? \
+ 	( \
+ 		VARSIZE_ANY(attptr) \
+ 	) \
+ 	: \
+ 	( \
+ 		AssertMacro((attlen) == -2), \
+ 		(strlen((char *) (attptr)) + 1) \
+ 	)) \
+ )
+ 
+ 
+ /*
   * store_att_byval is a partial inverse of fetch_att: store a given Datum
   * value into a tuple data area at the specified address.  However, it only
   * handles the byval case, because in typical usage the caller needs to
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 23,28 **** typedef struct PGLZ_Header
--- 23,30 ----
  	int32		rawsize;
  } PGLZ_Header;
  
+ #define PGLZ_HISTORY_SIZE		4096
+ #define PGLZ_MAX_MATCH			273
  
  /* ----------
   * PGLZ_MAX_OUTPUT -
***************
*** 86,91 **** typedef struct PGLZ_Strategy
--- 88,188 ----
  	int32		match_size_drop;
  } PGLZ_Strategy;
  
+ /*
+  * calculate the approximate length required for history encode tag for the
+  * given length
+  */
+ #define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+ ( \
+ 	((_len) < 17) ?	(3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH)))	\
+ )
+ 
+ /* ----------
+  * pglz_out_ctrl -
+  *
+  *		Outputs the last and allocates a new control byte if needed.
+  * ----------
+  */
+ #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+ do { \
+ 	if ((__ctrl & 0xff) == 0)												\
+ 	{																		\
+ 		*(__ctrlp) = __ctrlb;												\
+ 		__ctrlp = (__buf)++;												\
+ 		__ctrlb = 0;														\
+ 		__ctrl = 1; 														\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_literal -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+ do { \
+ 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 	*(_buf)++ = (unsigned char)(_byte); 									\
+ 	_ctrl <<= 1;															\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_tag -
+  *
+  *		Outputs a backward/history reference tag of 2-4 bytes (depending on
+  *		offset and length) to the destination buffer including the
+  *		appropriate control bit.
+  *
+  *		Split the process of backward/history reference as different chunks,
+  *		if the given lenght is more than max match and repeats the process
+  *		until the given length is processed.
+  * ----------
+  */
+ #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+ do { \
+ 	int _mlen;																	\
+ 	int _total_len = (_len);													\
+ 	while (_total_len > 0)														\
+ 	{																			\
+ 		_mlen = _total_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _total_len;		\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 		_ctrlb |= _ctrl;														\
+ 		_ctrl <<= 1;															\
+ 		if (_mlen > 17)															\
+ 		{																		\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+ 			(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+ 			(_buf)[2] = (unsigned char)((_mlen) - 18);							\
+ 			(_buf) += 3;														\
+ 		} else {																\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mlen) - 3)); \
+ 			(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+ 			(_buf) += 2;														\
+ 		}																		\
+ 		_total_len -= _mlen;													\
+ 		(_off) += _mlen;														\
+ 	}																			\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_add -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit until the given input length.
+  * ----------
+  */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ 	int32 _total_len = (_len);											\
+ 	while (_total_len-- > 0)											\
+ 	{																	\
+ 		pglz_out_literal(_ctrlp, _ctrlb, _ctrl, _buf, *(_byte));		\
+ 		(_byte) = (char *)(_byte) + 1;									\
+ 	}																	\
+ } while (0)
+ 
  
  /* ----------
   * The standard strategies
***************
*** 108,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! 
  #endif   /* _PG_LZCOMPRESS_H_ */
--- 205,210 ----
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! extern void pglz_decompress_with_history(const char *source, char *dest,
! 				uint32 *destlen, const char *history);
  #endif   /* _PG_LZCOMPRESS_H_ */

wal_update_changes_mod_lz_v5.patchapplication/octet-stream; name=wal_update_changes_mod_lz_v5.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,65 ****
--- 60,66 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
  
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
***************
*** 297,308 **** heap_attisnull(HeapTuple tup, int attnum)
  }
  
  /* ----------------
!  *		nocachegetattr
   *
!  *		This only gets called from fastgetattr() macro, in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
--- 298,310 ----
  }
  
  /* ----------------
!  *		nocachegetattr_with_len
   *
!  *		This only gets called in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor and
!  *		outputs the length of the attribute value.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
***************
*** 320,328 **** heap_attisnull(HeapTuple tup, int attnum)
   * ----------------
   */
  Datum
! nocachegetattr(HeapTuple tuple,
  			   int attnum,
! 			   TupleDesc tupleDesc)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
--- 322,331 ----
   * ----------------
   */
  Datum
! nocachegetattr_with_len(HeapTuple tuple,
  			   int attnum,
! 			   TupleDesc tupleDesc,
! 			   int32	*len)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
***************
*** 381,386 **** nocachegetattr(HeapTuple tuple,
--- 384,391 ----
  		 */
  		if (att[attnum]->attcacheoff >= 0)
  		{
+ 			*len = att_getlength(att[attnum]->attlen,
+ 								tp + att[attnum]->attcacheoff);
  			return fetchatt(att[attnum],
  							tp + att[attnum]->attcacheoff);
  		}
***************
*** 507,515 **** nocachegetattr(HeapTuple tuple,
--- 512,534 ----
  		}
  	}
  
+ 	*len = att_getlength(att[attnum]->attlen, tp + off);
  	return fetchatt(att[attnum], tp + off);
  }
  
+ /*
+  *	nocachegetattr
+  */
+ Datum
+ nocachegetattr(HeapTuple tuple,
+ 			   int attnum,
+ 			   TupleDesc tupleDesc)
+ {
+ 	int32 len;
+ 
+ 	return nocachegetattr_with_len(tuple, attnum, tupleDesc, &len);
+ }
+ 
  /* ----------------
   *		heap_getsysattr
   *
***************
*** 618,623 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 637,1012 ----
  }
  
  /*
+  * Check if the specified attribute's value is same in both given tuples.
+  * and outputs the length of the given attribute in both tuples.
+  */
+ bool
+ heap_attr_get_length_and_check_equals(TupleDesc tupdesc, int attrnum,
+ 					   HeapTuple tup1, HeapTuple tup2,
+ 					   int32 *tup1_attr_len, int32 *tup2_attr_len)
+ {
+ 	Datum		value1,
+ 				value2;
+ 	bool		isnull1,
+ 				isnull2;
+ 	Form_pg_attribute att;
+ 
+ 	*tup1_attr_len = 0;
+ 	*tup2_attr_len = 0;
+ 
+ 	/*
+ 	 * If it's a whole-tuple reference, say "not equal".  It's not really
+ 	 * worth supporting this case, since it could only succeed after a no-op
+ 	 * update, which is hardly a case worth optimizing for.
+ 	 */
+ 	if (attrnum == 0)
+ 		return false;
+ 
+ 	/*
+ 	 * Likewise, automatically say "not equal" for any system attribute other
+ 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ 	 * chain, or even to be set correctly yet in the new tuple.
+ 	 */
+ 	if (attrnum < 0)
+ 	{
+ 		if (attrnum != ObjectIdAttributeNumber &&
+ 			attrnum != TableOidAttributeNumber)
+ 			return false;
+ 	}
+ 
+ 	/*
+ 	 * Extract the corresponding values.  XXX this is pretty inefficient if
+ 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
+ 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
+ 	 * work for system columns ...
+ 	 */
+ 	value1 = heap_getattr_with_len(tup1, attrnum, tupdesc, &isnull1, tup1_attr_len);
+ 	value2 = heap_getattr_with_len(tup2, attrnum, tupdesc, &isnull2, tup2_attr_len);
+ 
+ 	/*
+ 	 * If one value is NULL and other is not, then they are certainly not
+ 	 * equal
+ 	 */
+ 	if (isnull1 != isnull2)
+ 		return false;
+ 
+ 	/*
+ 	 * If both are NULL, they can be considered equal.
+ 	 */
+ 	if (isnull1)
+ 		return true;
+ 
+ 	/*
+ 	 * We do simple binary comparison of the two datums.  This may be overly
+ 	 * strict because there can be multiple binary representations for the
+ 	 * same logical value.	But we should be OK as long as there are no false
+ 	 * positives.  Using a type-specific equality operator is messy because
+ 	 * there could be multiple notions of equality in different operator
+ 	 * classes; furthermore, we cannot safely invoke user-defined functions
+ 	 * while holding exclusive buffer lock.
+ 	 */
+ 	if (attrnum <= 0)
+ 	{
+ 		/* The only allowed system columns are OIDs, so do this */
+ 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ 	}
+ 	else
+ 	{
+ 		Assert(attrnum <= tupdesc->natts);
+ 		att = tupdesc->attrs[attrnum - 1];
+ 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ 	}
+ }
+ 
+ /* ----------------
+  * heap_delta_encode
+  *		Forms an encoded data from old and new tuple with the modified columns
+  *		using an algorithm similar to LZ algorithm.
+  *
+  *		tupleDesc - Tuple descriptor.
+  *		oldtup - pointer to the old/history tuple.
+  *		newtup - pointer to the new tuple.
+  *		encdata - pointer to the encoded data using lz algorithm.
+  *
+  *      Encode the bitmap [+padding] [+oid] as a new data. And loop for all
+  *		attributes to find any modifications in the attributes.
+  *
+  *		The unmodified data is encoded as a history tag to the output and the
+  *      modifed data is encoded as new data to the output.
+  *
+  *		If the encoded output data is less than 75% of original data,
+  *		The output data is considered as encoded and proceed further.
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  PGLZ_Header *encdata)
+ {
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	int			numberOfAttributes;
+ 	int32		new_off = 0,
+ 				old_off = 0,
+ 				temp_off = 0,
+ 				match_off = 0,
+ 				change_off = 0;
+ 	int			attnum;
+ 	int32		data_len,
+ 				old_pad_len,
+ 				new_pad_len,
+ 				old_attr_len,
+ 				new_attr_len;
+ 	bool		match_not_found = false;
+ 	unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+ 	unsigned char *bstart = bp;
+ 	char *dp = (char *)newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	char *dstart = dp;
+ 	unsigned char ctrl_dummy = 0;
+ 	unsigned char *ctrlp = &ctrl_dummy;
+ 	unsigned char ctrlb = 0;
+ 	unsigned char ctrl = 0;
+ 	int32 		len,
+ 				old_bitmaplen,
+ 				new_bitmaplen,
+ 				new_tup_len;
+ 	int32		result_size;
+ 	int32		result_max;
+ 
+ 	/* Include the bitmap header in the lz encoded data. */
+ 	old_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * The maximum encoded data is of 75% of total size. The max tuple size
+ 	 * is already validated as it cannot be more than MaxHeapTupleSize.
+ 	 */
+ 	result_max = (new_tup_len * 75) / 100;
+ 	encdata->rawsize = new_tup_len;
+ 
+ 	/*
+ 	 * Check for output buffer is reached the result_max by advancing
+ 	 * the buffer by the calculated aproximate length for the
+ 	 * corresponding operation.
+ 	 */
+ 	if ((bp + 2 + new_bitmaplen) - bstart >= result_max)
+ 		return false;
+ 
+ 	/* Copy the bitmap data from new tuple to the encoded data buffer */
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_bitmaplen, dp);
+ 	dstart = dp;
+ 
+ 	numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+ 	for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+ 	{
+ 		/*
+ 		 * If the attribute is modified by the update operation, store the
+ 		 * appropiate offsets in the WAL record, otherwise skip to the next
+ 		 * attribute.
+ 		 */
+ 		if (!heap_attr_get_length_and_check_equals(tupleDesc, attnum, oldtup,
+ 							newtup, &old_attr_len, &new_attr_len))
+ 		{
+ 			match_not_found = true;
+ 			data_len = old_off - match_off;
+ 
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 			if ((bp + len) - bstart >= result_max)
+ 				return false;
+ 
+ 			/*
+ 			 * The match_off value is calculated w.r.t to the tuple t_hoff
+ 			 * value, the bit map len needs to be added to match_off to get the
+ 			 * actual start offfset from the old/history tuple.
+ 			 */
+ 			match_off += old_bitmaplen;
+ 
+ 			/*
+ 			 * If any unchanged data presents in the old and new tuples then
+ 			 * encode the data as it needs to copy from history tuple with len
+ 			 * and offset.
+ 			 */
+ 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off);
+ 
+ 			/*
+ 			 * Recalculate the old and new tuple offsets based on padding
+ 			 * present in the tuples
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				old_off = att_align_pointer(old_off,
+ 								  att[attnum - 1]->attalign,
+ 								  att[attnum - 1]->attlen,
+ 					(char *)oldtup->t_data + oldtup->t_data->t_hoff + old_off);
+ 			}
+ 
+ 			if (!HeapTupleHasNulls(newtup)
+ 				|| !att_isnull((attnum - 1), newtup->t_data->t_bits))
+ 			{
+ 				new_off = att_align_pointer(new_off,
+ 									  att[attnum - 1]->attalign,
+ 									  att[attnum - 1]->attlen,
+ 					(char *)newtup->t_data + newtup->t_data->t_hoff + new_off);
+ 			}
+ 
+ 			old_off += old_attr_len;
+ 			new_off += new_attr_len;
+ 
+ 			match_off = old_off;
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			data_len = new_off - change_off;
+ 			if ((bp + data_len + 2) - bstart >= result_max)
+ 				return false;
+ 
+ 			/* Copy the modified column data to the output buffer if present */
+ 			pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 			/*
+ 			 * calculate the old tuple field start position, required to
+ 			 * ignore if any alignmet is present.
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				temp_off = old_off;
+ 				old_off = att_align_pointer(old_off,
+ 								  att[attnum - 1]->attalign,
+ 								  att[attnum - 1]->attlen,
+ 				 (char *)oldtup->t_data + oldtup->t_data->t_hoff + old_off);
+ 
+ 				old_pad_len = old_off - temp_off;
+ 
+ 				/*
+ 				 * calculate the new tuple field start position to check whether
+ 				 * any padding is required or not because field alignment.
+ 				 */
+ 				temp_off = new_off;
+ 				new_off = att_align_pointer(new_off,
+ 									  att[attnum - 1]->attalign,
+ 									  att[attnum - 1]->attlen,
+ 					(char *)newtup->t_data + newtup->t_data->t_hoff + new_off);
+ 				new_pad_len = new_off - temp_off;
+ 
+ 				/*
+ 				 * Checking for that is there any alignment difference between
+ 				 * old and new tuple attributes.
+ 				 */
+ 				if (old_pad_len != new_pad_len)
+ 				{
+ 					/*
+ 					 * If the alignment difference is found between old and new
+ 					 * tuples and the last attribute value of the new tuple is
+ 					 * same as old tuple then write the encode as history data
+ 					 * until the current match.
+ 					 *
+ 					 * If the last attribute value of new tuple is not same as
+ 					 * old tuple then the matched data marking as history is
+ 					 * already taken care.
+ 					 */
+ 					if (!match_not_found)
+ 					{
+ 						/*
+ 						 * Check for output buffer is reached the result_max by
+ 						 * advancing the buffer by the calculated aproximate
+ 						 * length for the corresponding operation.
+ 						 */
+ 						data_len = old_off - old_pad_len - match_off;
+ 						len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 						if ((bp + len) - bstart >= result_max)
+ 							return false;
+ 
+ 						match_off += old_bitmaplen;
+ 						pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off);
+ 					}
+ 
+ 					match_off = old_off;
+ 
+ 					/* Alignment data */
+ 					if ((bp + new_pad_len + 2) - bstart >= result_max)
+ 						return false;
+ 
+ 					pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_pad_len, dp);
+ 				}
+ 			}
+ 
+ 			old_off += old_attr_len;
+ 			new_off += new_attr_len;
+ 
+ 			change_off = new_off;
+ 
+ 			/*
+ 			 * Recalculate the destination pointer with the new offset which is
+ 			 * used while copying the modified data.
+ 			 */
+ 			dp = dstart + new_off;
+ 			match_not_found = false;
+ 		}
+ 	}
+ 
+ 	/* If any modified column data presents then copy it. */
+ 	data_len = new_off - change_off;
+ 	if ((bp + data_len + 2) - bstart >= result_max)
+ 		return false;
+ 
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 	/* If any left out old tuple data presents then copy it as history */
+ 	data_len = old_off - match_off;
+ 	len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 	if ((bp + len) - bstart >= result_max)
+ 		return false;
+ 
+ 	match_off += old_bitmaplen;
+ 	pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off);
+ 
+ 	/*
+ 	 * Write out the last control byte and check that we haven't overrun the
+ 	 * output size allowed by the strategy.
+ 	 */
+ 	*ctrlp = ctrlb;
+ 
+ 	result_size = bp - bstart;
+ 	if (result_size >= result_max)
+ 		return false;
+ 
+ 	/*
+ 	 * Success - need only fill in the actual length of the compressed datum.
+ 	 */
+ 	SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+ 	return true;
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *		Decodes the encoded data to dest tuple with the help of history.
+  *
+  *		encdata - Pointer to the encoded data.
+  *		oldtup - pointer to the history tuple.
+  *		newtup - pointer to the destination tuple.
+  * ----------------
+  */
+ void
+ heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	return pglz_decompress_with_history((char *)encdata,
+ 				(char *)newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 				&newtup->t_len,
+ 				(char *)oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
+ /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
   *		which are of the length indicated by tupleDescriptor->natts
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 85,90 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
--- 85,91 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+ 				HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
***************
*** 844,851 **** heapgettup_pagemode(HeapScanDesc scan,
   * definition in access/htup.h is maintained.
   */
  Datum
! fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
! 			bool *isnull)
  {
  	return (
  			(attnum) > 0 ?
--- 845,852 ----
   * definition in access/htup.h is maintained.
   */
  Datum
! fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
! 			bool *isnull, int32 *len)
  {
  	return (
  			(attnum) > 0 ?
***************
*** 855,877 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			 (
  			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
  			  (
! 			   fetchatt((tupleDesc)->attrs[(attnum) - 1],
  						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr((tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
  			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
  			  (
  			   (*(isnull) = true),
  			   (Datum) NULL
  			   )
  			  :
  			  (
! 			   nocachegetattr((tup), (attnum), (tupleDesc))
  			   )
  			  )
  			 )
--- 856,882 ----
  			 (
  			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
  			  (
! 				(*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
! 							(char *)(tup)->t_data + (tup)->t_data->t_hoff +
! 							(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
! 			    fetchatt((tupleDesc)->attrs[(attnum) - 1],
  						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
  			  )
  			 :
  			 (
  			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
  			  (
  			   (*(isnull) = true),
+ 			   (*(len) = 0),
  			   (Datum) NULL
  			   )
  			  :
  			  (
! 			   nocachegetattr_with_len((tup), (attnum), (tupleDesc), len)
  			   )
  			  )
  			 )
***************
*** 881,886 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
--- 886,903 ----
  			 )
  		);
  }
+ /*
+  * This is formatted so oddly so that the correspondence to the macro
+  * definition in access/htup.h is maintained.
+  */
+ Datum
+ fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+ 			bool *isnull)
+ {
+ 	int32 len;
+ 
+ 	return fastgetattr_with_len(tup, attnum, tupleDesc, isnull, &len);
+ }
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  
***************
*** 3200,3209 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3217,3228 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 
! 		recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 								 newbuf, heaptup, &oldtup,
! 								 all_visible_cleared,
! 								 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
***************
*** 3270,3343 **** static bool
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Datum		value1,
! 				value2;
! 	bool		isnull1,
! 				isnull2;
! 	Form_pg_attribute att;
! 
! 	/*
! 	 * If it's a whole-tuple reference, say "not equal".  It's not really
! 	 * worth supporting this case, since it could only succeed after a no-op
! 	 * update, which is hardly a case worth optimizing for.
! 	 */
! 	if (attrnum == 0)
! 		return false;
  
! 	/*
! 	 * Likewise, automatically say "not equal" for any system attribute other
! 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
! 	 * chain, or even to be set correctly yet in the new tuple.
! 	 */
! 	if (attrnum < 0)
! 	{
! 		if (attrnum != ObjectIdAttributeNumber &&
! 			attrnum != TableOidAttributeNumber)
! 			return false;
! 	}
! 
! 	/*
! 	 * Extract the corresponding values.  XXX this is pretty inefficient if
! 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
! 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
! 	 * work for system columns ...
! 	 */
! 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
! 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
! 
! 	/*
! 	 * If one value is NULL and other is not, then they are certainly not
! 	 * equal
! 	 */
! 	if (isnull1 != isnull2)
! 		return false;
! 
! 	/*
! 	 * If both are NULL, they can be considered equal.
! 	 */
! 	if (isnull1)
! 		return true;
! 
! 	/*
! 	 * We do simple binary comparison of the two datums.  This may be overly
! 	 * strict because there can be multiple binary representations for the
! 	 * same logical value.	But we should be OK as long as there are no false
! 	 * positives.  Using a type-specific equality operator is messy because
! 	 * there could be multiple notions of equality in different operator
! 	 * classes; furthermore, we cannot safely invoke user-defined functions
! 	 * while holding exclusive buffer lock.
! 	 */
! 	if (attrnum <= 0)
! 	{
! 		/* The only allowed system columns are OIDs, so do this */
! 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
! 	}
! 	else
! 	{
! 		Assert(attrnum <= tupdesc->natts);
! 		att = tupdesc->attrs[attrnum - 1];
! 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
! 	}
  }
  
  /*
--- 3289,3299 ----
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	int32 tup1_attr_len,
! 		  tup2_attr_len;
  
! 	return heap_attr_get_length_and_check_equals(tupdesc, attrnum, tup1, tup2,
! 											 &tup1_attr_len, &tup2_attr_len);
  }
  
  /*
***************
*** 4435,4441 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
--- 4391,4397 ----
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
***************
*** 4444,4449 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 4400,4416 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	int			oldtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds max output possible from the LZ algorithm */
+ 	struct
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[MaxHeapTupleSize];
+ 	}			buf;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 4453,4463 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4420,4460 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 	oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/* Is the update is going to the same page? */
+ 	if (oldbuf == newbuf)
+ 	{
+ 		/*
+ 		 * LZ algorithm can hold only history offset in the range of 1 - 4095.
+ 		 * so the delta encode is restricted for the tuples with length more
+ 		 * than PGLZ_HISTORY_SIZE.
+ 		 */
+ 		if (oldtuplen < PGLZ_HISTORY_SIZE)
+ 		{
+ 			/* Delta-encode the new tuple using the old tuple */
+ 			if (heap_delta_encode(reln->rd_att, oldtup, newtup,
+ 									&buf.pglzheader))
+ 			{
+ 				compressed = true;
+ 				newtupdata = (char *)&buf.pglzheader;
+ 				newtuplen = VARSIZE(&buf.pglzheader);
+ 			}
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 4484,4492 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 4481,4495 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * -----------
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
! 	 *					OR
! 	 * PG93FORMAT [If encoded]: LZ header + Encoded data
! 	 * -----------
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 5262,5268 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5265,5274 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 5277,5283 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5283,5289 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5337,5343 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
--- 5343,5349 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
***************
*** 5356,5362 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5362,5368 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5381,5387 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5387,5393 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5444,5453 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 5450,5478 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the new tuple was delta-encoded, decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/* PG93FORMAT: LZ header + Encoded data */
! 		PGLZ_Header *encoded_data = (PGLZ_Header   *)(((char *) xlrec) + hsize);
! 
! 		oldtup.t_data = oldtupdata;
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode(encoded_data, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 5462,5468 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 5487,5493 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 182,190 ****
   */
  #define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
  #define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
- #define PGLZ_HISTORY_SIZE		4096
- #define PGLZ_MAX_MATCH			273
- 
  
  /* ----------
   * PGLZ_HistEntry -
--- 182,187 ----
***************
*** 302,368 **** do {									\
  			}																\
  } while (0)
  
- 
- /* ----------
-  * pglz_out_ctrl -
-  *
-  *		Outputs the last and allocates a new control byte if needed.
-  * ----------
-  */
- #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
- do { \
- 	if ((__ctrl & 0xff) == 0)												\
- 	{																		\
- 		*(__ctrlp) = __ctrlb;												\
- 		__ctrlp = (__buf)++;												\
- 		__ctrlb = 0;														\
- 		__ctrl = 1;															\
- 	}																		\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_literal -
-  *
-  *		Outputs a literal byte to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	*(_buf)++ = (unsigned char)(_byte);										\
- 	_ctrl <<= 1;															\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_tag -
-  *
-  *		Outputs a backward reference tag of 2-4 bytes (depending on
-  *		offset and length) to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	_ctrlb |= _ctrl;														\
- 	_ctrl <<= 1;															\
- 	if (_len > 17)															\
- 	{																		\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
- 		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
- 		(_buf)[2] = (unsigned char)((_len) - 18);							\
- 		(_buf) += 3;														\
- 	} else {																\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
- 		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
- 		(_buf) += 2;														\
- 	}																		\
- } while (0)
- 
- 
  /* ----------
   * pglz_find_match -
   *
--- 299,304 ----
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 583,620 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest.
+  *		To decompress, it uses history if provided.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 699,726 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  					break;
  				}
  
! 				/*
! 				 * Now we copy the bytes specified by the tag from OUTPUT to
! 				 * OUTPUT. It is dangerous and platform dependent to use
! 				 * memcpy() here, because the copied areas could overlap
! 				 * extremely!
! 				 */
! 				while (len--)
  				{
! 					*dp = dp[-off];
! 					dp++;
  				}
  			}
  			else
  			{
! 				/*
! 				 * An unset control bit means LITERAL BYTE. So we just copy
! 				 * one from INPUT to OUTPUT.
! 				 */
! 				if (dp >= destend)		/* check for buffer overrun */
! 					break;		/* do not clobber memory */
! 
! 				*dp++ = *sp++;
  			}
  
  			/*
--- 658,735 ----
  					break;
  				}
  
! 				if (history)
! 				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from history to
! 					 * OUTPUT.
! 					 */
! 					memcpy(dp, history + off, len);
! 					dp += len;
! 				}
! 				else
  				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from OUTPUT to
! 					 * OUTPUT. It is dangerous and platform dependent to use
! 					 * memcpy() here, because the copied areas could overlap
! 					 * extremely!
! 					 */
! 					while (len--)
! 					{
! 						*dp = dp[-off];
! 						dp++;
! 					}
  				}
  			}
  			else
  			{
! 				if (history)
! 				{
! 					/*
! 					 * Otherwise it contains the match length minus 3 and the
! 					 * upper 4 bits of the offset. The next following byte
! 					 * contains the lower 8 bits of the offset. If the length is
! 					 * coded as 18, another extension tag byte tells how much
! 					 * longer the match really was (0-255).
! 					 */
! 					int32		len;
! 
! 					len = sp[0];
! 					sp += 1;
! 
! 					/*
! 					 * Check for output buffer overrun, to ensure we don't clobber
! 					 * memory in case of corrupt input.  Note: we must advance dp
! 					 * here to ensure the error is detected below the loop.  We
! 					 * don't simply put the elog inside the loop since that will
! 					 * probably interfere with optimization.
! 					 */
! 					if (dp + len > destend)
! 					{
! 						dp += len;
! 						break;
! 					}
! 
! 					/*
! 					 * Now we copy the bytes specified by the tag from Source to
! 					 * OUTPUT.
! 					 */
! 					memcpy(dp, sp, len);
! 					dp += len;
! 					sp += len;
! 				}
! 				else
! 				{
! 					/*
! 					 * An unset control bit means LITERAL BYTE. So we just copy
! 					 * one from INPUT to OUTPUT.
! 					 */
! 					if (dp >= destend)		/* check for buffer overrun */
! 						break;		/* do not clobber memory */
! 
! 					*dp++ = *sp++;
! 				}
  			}
  
  			/*
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 142,161 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	char		flags;			/* flag bits, see below */
! 
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! 
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01 /* Indicates as old page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02 /* Indicates as new page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04 /* Indicates as the update
! 												operation is delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 18,23 ****
--- 18,24 ----
  #include "access/tupdesc.h"
  #include "access/tupmacs.h"
  #include "storage/bufpage.h"
+ #include "utils/pg_lzcompress.h"
  
  /*
   * MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
***************
*** 528,533 **** struct MinimalTupleData
--- 529,535 ----
  		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
  
  
+ #if !defined(DISABLE_COMPLEX_MACRO)
  /* ----------------
   *		fastgetattr
   *
***************
*** 542,550 **** struct MinimalTupleData
   *		lookups, and call nocachegetattr() for the rest.
   * ----------------
   */
- 
- #if !defined(DISABLE_COMPLEX_MACRO)
- 
  #define fastgetattr(tup, attnum, tupleDesc, isnull)					\
  (																	\
  	AssertMacro((attnum) > 0),										\
--- 544,549 ----
***************
*** 572,585 **** struct MinimalTupleData
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
  )
- #else							/* defined(DISABLE_COMPLEX_MACRO) */
  
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
- 
  /* ----------------
   *		heap_getattr
   *
--- 571,626 ----
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
+ )																	\
+ 
+ /* ----------------
+  *		fastgetattr_with_len
+  *
+  *		Similar to fastgetattr and fetches the length of the given attribute
+  *		also.
+  * ----------------
+  */
+ #define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len)	\
+ (																	\
+ 	AssertMacro((attnum) > 0),										\
+ 	(*(isnull) = false),											\
+ 	HeapTupleNoNulls(tup) ? 										\
+ 	(																\
+ 		(tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ?			\
+ 		(															\
+ 			(*(len) = att_getlength(								\
+ 					(tupleDesc)->attrs[(attnum)-1]->attlen, 		\
+ 					(char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)),	\
+ 			fetchatt((tupleDesc)->attrs[(attnum)-1],				\
+ 				(char *) (tup)->t_data + (tup)->t_data->t_hoff +	\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)	\
+ 		)															\
+ 		:															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 	)																\
+ 	:																\
+ 	(																\
+ 		att_isnull((attnum)-1, (tup)->t_data->t_bits) ? 			\
+ 		(															\
+ 			(*(isnull) = true), 									\
+ 			(*(len) = 0),											\
+ 			(Datum)NULL 											\
+ 		)															\
+ 		:															\
+ 		(															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 		)															\
+ 	)																\
  )
  
+ #else							/* defined(DISABLE_COMPLEX_MACRO) */
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
+ extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+ 			TupleDesc tupleDesc, bool *isnull, int32 *len);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  /* ----------------
   *		heap_getattr
   *
***************
*** 596,616 **** extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
  	( \
! 		((attnum) > 0) ? \
  		( \
! 			((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
! 			( \
! 				(*(isnull) = true), \
! 				(Datum)NULL \
! 			) \
! 			: \
! 				fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
  		) \
  		: \
! 			heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	)
  
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
--- 637,679 ----
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
+ ( \
+ 	((attnum) > 0) ? \
  	( \
! 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
  		( \
! 			(*(isnull) = true), \
! 			(Datum)NULL \
  		) \
  		: \
! 			fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	) \
! 	: \
! 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
  
+ /* ----------------
+  *		heap_getattr_with_len
+  *
+  *		Similar to heap_getattr and outputs the length of the given attribute.
+  * ----------------
+  */
+ #define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ 	((attnum) > 0) ? \
+ 	( \
+ 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+ 		( \
+ 			(*(isnull) = true), \
+ 			(*(len) = 0), \
+ 			(Datum)NULL \
+ 		) \
+ 		: \
+ 			fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+ 	) \
+ 	: \
+ 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+ )
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
***************
*** 620,625 **** extern void heap_fill_tuple(TupleDesc tupleDesc,
--- 683,690 ----
  				char *data, Size data_size,
  				uint16 *infomask, bits8 *bit);
  extern bool heap_attisnull(HeapTuple tup, int attnum);
+ extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+ 			   TupleDesc att, int32 *len);
  extern Datum nocachegetattr(HeapTuple tup, int attnum,
  			   TupleDesc att);
  extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
***************
*** 636,641 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 701,714 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_attr_get_length_and_check_equals(TupleDesc tupdesc,
+ 						int attrnum, HeapTuple tup1, HeapTuple tup2,
+ 						int32 *tup1_attr_len, int32 *tup2_attr_len);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				HeapTuple newtup, PGLZ_Header *encdata);
+ extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+ 				HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/tupmacs.h
--- b/src/include/access/tupmacs.h
***************
*** 187,192 ****
--- 187,214 ----
  )
  
  /*
+  * att_getlength -
+  *			Gets the length of the attribute.
+  */
+ #define att_getlength(attlen, attptr) \
+ ( \
+ 	((attlen) > 0) ? \
+ 	( \
+ 		(attlen) \
+ 	) \
+ 	: (((attlen) == -1) ? \
+ 	( \
+ 		VARSIZE_ANY(attptr) \
+ 	) \
+ 	: \
+ 	( \
+ 		AssertMacro((attlen) == -2), \
+ 		(strlen((char *) (attptr)) + 1) \
+ 	)) \
+ )
+ 
+ 
+ /*
   * store_att_byval is a partial inverse of fetch_att: store a given Datum
   * value into a tuple data area at the specified address.  However, it only
   * handles the byval case, because in typical usage the caller needs to
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 23,28 **** typedef struct PGLZ_Header
--- 23,30 ----
  	int32		rawsize;
  } PGLZ_Header;
  
+ #define PGLZ_HISTORY_SIZE		4096
+ #define PGLZ_MAX_MATCH			273
  
  /* ----------
   * PGLZ_MAX_OUTPUT -
***************
*** 86,91 **** typedef struct PGLZ_Strategy
--- 88,196 ----
  	int32		match_size_drop;
  } PGLZ_Strategy;
  
+ /*
+  * calculate the approximate length required for history encode tag for the
+  * given length
+  */
+ #define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+ ( \
+ 	((_len) < 17) ?	(3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH)))	\
+ )
+ 
+ /* ----------
+  * pglz_out_ctrl -
+  *
+  *		Outputs the last and allocates a new control byte if needed.
+  * ----------
+  */
+ #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+ do { \
+ 	if ((__ctrl & 0xff) == 0)												\
+ 	{																		\
+ 		*(__ctrlp) = __ctrlb;												\
+ 		__ctrlp = (__buf)++;												\
+ 		__ctrlb = 0;														\
+ 		__ctrl = 1; 														\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_literal -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+ do { \
+ 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 	*(_buf)++ = (unsigned char)(_byte); 									\
+ 	_ctrl <<= 1;															\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_tag -
+  *
+  *		Outputs a backward/history reference tag of 2-4 bytes (depending on
+  *		offset and length) to the destination buffer including the
+  *		appropriate control bit.
+  *
+  *		Split the process of backward/history reference as different chunks,
+  *		if the given lenght is more than max match and repeats the process
+  *		until the given length is processed.
+  * ----------
+  */
+ #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+ do { \
+ 	int _mlen;																	\
+ 	int _total_len = (_len);													\
+ 	while (_total_len > 0)														\
+ 	{																			\
+ 		_mlen = _total_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _total_len;		\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 		_ctrlb |= _ctrl;														\
+ 		_ctrl <<= 1;															\
+ 		if (_mlen > 17)															\
+ 		{																		\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+ 			(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+ 			(_buf)[2] = (unsigned char)((_mlen) - 18);							\
+ 			(_buf) += 3;														\
+ 		} else {																\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mlen) - 3)); \
+ 			(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+ 			(_buf) += 2;														\
+ 		}																		\
+ 		_total_len -= _mlen;													\
+ 		(_off) += _mlen;														\
+ 	}																			\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_add -
+  *
+  *		Outputs a reference tag of 1 byte with length and the new data
+  *		to the destination buffer, including the appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ 	int32 _mlen;																\
+ 	int32 _total_len = (_len);															\
+ 	while (_total_len > 0)														\
+ 	{																		\
+ 		_mlen = _total_len > 255 ? 255 : _total_len;								\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);							\
+ 		_ctrl <<= 1;														\
+ 		(_buf)[0] = (unsigned char)(_mlen);									\
+ 		(_buf) += 1;														\
+ 		memcpy((_buf), (_byte), _mlen);										\
+ 		(_buf) += _mlen;													\
+ 		(_byte) += _mlen;													\
+ 		_total_len -= _mlen;													\
+ 	}																		\
+ } while (0)
+ 
  
  /* ----------
   * The standard strategies
***************
*** 108,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! 
  #endif   /* _PG_LZCOMPRESS_H_ */
--- 213,218 ----
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! extern void pglz_decompress_with_history(const char *source, char *dest,
! 				uint32 *destlen, const char *history);
  #endif   /* _PG_LZCOMPRESS_H_ */

pgbench_250.ctext/plain; name=pgbench_250.cDownload

/*
 * pgbench.c
 *
 * A simple benchmark program for PostgreSQL
 * Originally written by Tatsuo Ishii and enhanced by many contributors.
 *
 * contrib/pgbench/pgbench.c
 * Copyright (c) 2000-2012, PostgreSQL Global Development Group
 * ALL RIGHTS RESERVED;
 *
 * Permission to use, copy, modify, and distribute this software and its
 * documentation for any purpose, without fee, and without a written agreement
 * is hereby granted, provided that the above copyright notice and this
 * paragraph and the following two paragraphs appear in all copies.
 *
 * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
 * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
 * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
 * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
 * POSSIBILITY OF SUCH DAMAGE.
 *
 * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIMS ANY WARRANTIES,
 * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
 * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
 * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
 * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
 *
 */

#ifdef WIN32
#define FD_SETSIZE 1024			/* set before winsock2.h is included */
#endif   /* ! WIN32 */

#include "postgres_fe.h"

#include "getopt_long.h"
#include "libpq-fe.h"
#include "libpq/pqsignal.h"
#include "portability/instr_time.h"

#include <ctype.h>

#ifndef WIN32
#include <sys/time.h>
#include <unistd.h>
#endif   /* ! WIN32 */

#ifdef HAVE_SYS_SELECT_H
#include <sys/select.h>
#endif

#ifdef HAVE_SYS_RESOURCE_H
#include <sys/resource.h>		/* for getrlimit */
#endif

#ifndef INT64_MAX
#define INT64_MAX	INT64CONST(0x7FFFFFFFFFFFFFFF)
#endif

/*
 * Multi-platform pthread implementations
 */

#ifdef WIN32
/* Use native win32 threads on Windows */
typedef struct win32_pthread *pthread_t;
typedef int pthread_attr_t;

static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
static int	pthread_join(pthread_t th, void **thread_return);
#elif defined(ENABLE_THREAD_SAFETY)
/* Use platform-dependent pthread capability */
#include <pthread.h>
#else
/* Use emulation with fork. Rename pthread identifiers to avoid conflicts */

#include <sys/wait.h>

#define pthread_t				pg_pthread_t
#define pthread_attr_t			pg_pthread_attr_t
#define pthread_create			pg_pthread_create
#define pthread_join			pg_pthread_join

typedef struct fork_pthread *pthread_t;
typedef int pthread_attr_t;

static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
static int	pthread_join(pthread_t th, void **thread_return);
#endif

extern char *optarg;
extern int	optind;


/********************************************************************
 * some configurable parameters */

/* max number of clients allowed */
#ifdef FD_SETSIZE
#define MAXCLIENTS	(FD_SETSIZE - 10)
#else
#define MAXCLIENTS	1024
#endif

#define DEFAULT_NXACTS	10		/* default nxacts */

int			nxacts = 0;			/* number of transactions per client */
int			duration = 0;		/* duration in seconds */

/*
 * scaling factor. for example, scale = 10 will make 1000000 tuples in
 * pgbench_accounts table.
 */
int			scale = 1;

/*
 * fillfactor. for example, fillfactor = 90 will use only 90 percent
 * space during inserts and leave 10 percent free.
 */
int			fillfactor = 100;

/*
 * create foreign key constraints on the tables?
 */
int			foreign_keys = 0;

/*
 * use unlogged tables?
 */
int			unlogged_tables = 0;

/*
 * tablespace selection
 */
char	   *tablespace = NULL;
char	   *index_tablespace = NULL;

/*
 * end of configurable parameters
 *********************************************************************/

#define nbranches	1			/* Makes little sense to change this.  Change
								 * -s instead */
#define ntellers	10
#define naccounts	100000

bool		use_log;			/* log transaction latencies to a file */
bool		is_connect;			/* establish connection for each transaction */
bool		is_latencies;		/* report per-command latencies */
int			main_pid;			/* main process id used in log filename */

char	   *pghost = "";
char	   *pgport = "";
char	   *login = NULL;
char	   *dbName;
const char *progname;

volatile bool timer_exceeded = false;	/* flag from signal handler */

/* variable definitions */
typedef struct
{
	char	   *name;			/* variable name */
	char	   *value;			/* its value */
} Variable;

#define MAX_FILES		128		/* max number of SQL script files allowed */
#define SHELL_COMMAND_SIZE	256 /* maximum size allowed for shell command */

/*
 * structures used in custom query mode
 */

typedef struct
{
	PGconn	   *con;			/* connection handle to DB */
	int			id;				/* client No. */
	int			state;			/* state No. */
	int			cnt;			/* xacts count */
	int			ecnt;			/* error count */
	int			listen;			/* 0 indicates that an async query has been
								 * sent */
	int			sleeping;		/* 1 indicates that the client is napping */
	int64		until;			/* napping until (usec) */
	Variable   *variables;		/* array of variable definitions */
	int			nvariables;
	instr_time	txn_begin;		/* used for measuring transaction latencies */
	instr_time	stmt_begin;		/* used for measuring statement latencies */
	int			use_file;		/* index in sql_files for this client */
	bool		prepared[MAX_FILES];
} CState;

/*
 * Thread state and result
 */
typedef struct
{
	int			tid;			/* thread id */
	pthread_t	thread;			/* thread handle */
	CState	   *state;			/* array of CState */
	int			nstate;			/* length of state[] */
	instr_time	start_time;		/* thread start time */
	instr_time *exec_elapsed;	/* time spent executing cmds (per Command) */
	int		   *exec_count;		/* number of cmd executions (per Command) */
	unsigned short random_state[3];		/* separate randomness for each thread */
} TState;

#define INVALID_THREAD		((pthread_t) 0)

typedef struct
{
	instr_time	conn_time;
	int			xacts;
} TResult;

/*
 * queries read from files
 */
#define SQL_COMMAND		1
#define META_COMMAND	2
#define MAX_ARGS		10

typedef enum QueryMode
{
	QUERY_SIMPLE,				/* simple query */
	QUERY_EXTENDED,				/* extended query */
	QUERY_PREPARED,				/* extended query with prepared statements */
	NUM_QUERYMODE
} QueryMode;

static QueryMode querymode = QUERY_SIMPLE;
static const char *QUERYMODE[] = {"simple", "extended", "prepared"};

typedef struct
{
	char	   *line;			/* full text of command line */
	int			command_num;	/* unique index of this Command struct */
	int			type;			/* command type (SQL_COMMAND or META_COMMAND) */
	int			argc;			/* number of command words */
	char	   *argv[MAX_ARGS]; /* command word list */
} Command;

static Command **sql_files[MAX_FILES];	/* SQL script files */
static int	num_files;			/* number of script files */
static int	num_commands = 0;	/* total number of Command structs */
static int	debug = 0;			/* debug flag */

/* default scenario */
static char *tpc_b = {
	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"\\setrandom bid 1 :nbranches\n"
	"\\setrandom tid 1 :ntellers\n"
	"\\setrandom delta -5000 5000\n"
	"BEGIN;\n"
	"UPDATE pgbench_accounts SET abalance = abalance + :delta,"
	"filler = \'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ :delta\',"
	"filler1 = random_text_md5_v2(100) WHERE aid = :aid;\n"
	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta,"
	"filler = \'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ :delta\'"
	"filler1 = random_text_md5_v2(100) WHERE tid = :tid;\n"
	"UPDATE pgbench_branches SET bbalance = bbalance + :delta,"
	"filler = \'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ :delta\'"
	"filler1 = random_text_md5_v2(100) WHERE bid = :bid;\n"
	"END;\n"
};

/* -N case */
static char *simple_update = {
	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"\\setrandom bid 1 :nbranches\n"
	"\\setrandom tid 1 :ntellers\n"
	"\\setrandom delta -5000 5000\n"
	"BEGIN;\n"
	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
	"END;\n"
};

/* -S case */
static char *select_only = {
	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
	"\\setrandom aid 1 :naccounts\n"
	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
};

/* Function prototypes */
static void setalarm(int seconds);
static void *threadRun(void *arg);


/*
 * routines to check mem allocations and fail noisily.
 */
static void *
xmalloc(size_t size)
{
	void	   *result;

	result = malloc(size);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}

static void *
xrealloc(void *ptr, size_t size)
{
	void	   *result;

	result = realloc(ptr, size);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}

static char *
xstrdup(const char *s)
{
	char	   *result;

	result = strdup(s);
	if (!result)
	{
		fprintf(stderr, "out of memory\n");
		exit(1);
	}
	return result;
}


static void
usage(void)
{
	printf("%s is a benchmarking tool for PostgreSQL.\n\n"
		   "Usage:\n"
		   "  %s [OPTION]... [DBNAME]\n"
		   "\nInitialization options:\n"
		   "  -i           invokes initialization mode\n"
		   "  -n           do not run VACUUM after initialization\n"
		   "  -F NUM       fill factor\n"
		   "  -s NUM       scaling factor\n"
		   "  --foreign-keys\n"
		   "               create foreign key constraints between tables\n"
		   "  --index-tablespace=TABLESPACE\n"
		   "               create indexes in the specified tablespace\n"
		   "  --tablespace=TABLESPACE\n"
		   "               create tables in the specified tablespace\n"
		   "  --unlogged-tables\n"
		   "               create tables as unlogged tables\n"
		   "\nBenchmarking options:\n"
		"  -c NUM       number of concurrent database clients (default: 1)\n"
		   "  -C           establish new connection for each transaction\n"
		   "  -D VARNAME=VALUE\n"
		   "               define variable for use by custom script\n"
		   "  -f FILENAME  read transaction script from FILENAME\n"
		   "  -j NUM       number of threads (default: 1)\n"
		   "  -l           write transaction times to log file\n"
		   "  -M simple|extended|prepared\n"
		   "               protocol for submitting queries to server (default: simple)\n"
		   "  -n           do not run VACUUM before tests\n"
		   "  -N           do not update tables \"pgbench_tellers\" and \"pgbench_branches\"\n"
		   "  -r           report average latency per command\n"
		   "  -s NUM       report this scale factor in output\n"
		   "  -S           perform SELECT-only transactions\n"
	 "  -t NUM       number of transactions each client runs (default: 10)\n"
		   "  -T NUM       duration of benchmark test in seconds\n"
		   "  -v           vacuum all four standard tables before tests\n"
		   "\nCommon options:\n"
		   "  -d             print debugging output\n"
		   "  -h HOSTNAME    database server host or socket directory\n"
		   "  -p PORT        database server port number\n"
		   "  -U USERNAME    connect as specified database user\n"
		   "  -V, --version  output version information, then exit\n"
		   "  -?, --help     show this help, then exit\n"
		   "\n"
		   "Report bugs to <pgsql-bugs@postgresql.org>.\n",
		   progname, progname);
}

/* random number generator: uniform distribution from min to max inclusive */
static int
getrand(TState *thread, int min, int max)
{
	/*
	 * Odd coding is so that min and max have approximately the same chance of
	 * being selected as do numbers between them.
	 *
	 * pg_erand48() is thread-safe and concurrent, which is why we use it
	 * rather than random(), which in glibc is non-reentrant, and therefore
	 * protected by a mutex, and therefore a bottleneck on machines with many
	 * CPUs.
	 */
	return min + (int) ((max - min + 1) * pg_erand48(thread->random_state));
}

/* call PQexec() and exit() on failure */
static void
executeStatement(PGconn *con, const char *sql)
{
	PGresult   *res;

	res = PQexec(con, sql);
	if (PQresultStatus(res) != PGRES_COMMAND_OK)
	{
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}
	PQclear(res);
}

/* set up a connection to the backend */
static PGconn *
doConnect(void)
{
	PGconn	   *conn;
	static char *password = NULL;
	bool		new_pass;

	/*
	 * Start the connection.  Loop until we have a password if requested by
	 * backend.
	 */
	do
	{
#define PARAMS_ARRAY_SIZE	7

		const char *keywords[PARAMS_ARRAY_SIZE];
		const char *values[PARAMS_ARRAY_SIZE];

		keywords[0] = "host";
		values[0] = pghost;
		keywords[1] = "port";
		values[1] = pgport;
		keywords[2] = "user";
		values[2] = login;
		keywords[3] = "password";
		values[3] = password;
		keywords[4] = "dbname";
		values[4] = dbName;
		keywords[5] = "fallback_application_name";
		values[5] = progname;
		keywords[6] = NULL;
		values[6] = NULL;

		new_pass = false;

		conn = PQconnectdbParams(keywords, values, true);

		if (!conn)
		{
			fprintf(stderr, "Connection to database \"%s\" failed\n",
					dbName);
			return NULL;
		}

		if (PQstatus(conn) == CONNECTION_BAD &&
			PQconnectionNeedsPassword(conn) &&
			password == NULL)
		{
			PQfinish(conn);
			password = simple_prompt("Password: ", 100, false);
			new_pass = true;
		}
	} while (new_pass);

	/* check to see that the backend connection was successfully made */
	if (PQstatus(conn) == CONNECTION_BAD)
	{
		fprintf(stderr, "Connection to database \"%s\" failed:\n%s",
				dbName, PQerrorMessage(conn));
		PQfinish(conn);
		return NULL;
	}

	return conn;
}

/* throw away response from backend */
static void
discard_response(CState *state)
{
	PGresult   *res;

	do
	{
		res = PQgetResult(state->con);
		if (res)
			PQclear(res);
	} while (res);
}

static int
compareVariables(const void *v1, const void *v2)
{
	return strcmp(((const Variable *) v1)->name,
				  ((const Variable *) v2)->name);
}

static char *
getVariable(CState *st, char *name)
{
	Variable	key,
			   *var;

	/* On some versions of Solaris, bsearch of zero items dumps core */
	if (st->nvariables <= 0)
		return NULL;

	key.name = name;
	var = (Variable *) bsearch((void *) &key,
							   (void *) st->variables,
							   st->nvariables,
							   sizeof(Variable),
							   compareVariables);
	if (var != NULL)
		return var->value;
	else
		return NULL;
}

/* check whether the name consists of alphabets, numerals and underscores. */
static bool
isLegalVariableName(const char *name)
{
	int			i;

	for (i = 0; name[i] != '\0'; i++)
	{
		if (!isalnum((unsigned char) name[i]) && name[i] != '_')
			return false;
	}

	return true;
}

static int
putVariable(CState *st, const char *context, char *name, char *value)
{
	Variable	key,
			   *var;

	key.name = name;
	/* On some versions of Solaris, bsearch of zero items dumps core */
	if (st->nvariables > 0)
		var = (Variable *) bsearch((void *) &key,
								   (void *) st->variables,
								   st->nvariables,
								   sizeof(Variable),
								   compareVariables);
	else
		var = NULL;

	if (var == NULL)
	{
		Variable   *newvars;

		/*
		 * Check for the name only when declaring a new variable to avoid
		 * overhead.
		 */
		if (!isLegalVariableName(name))
		{
			fprintf(stderr, "%s: invalid variable name '%s'\n", context, name);
			return false;
		}

		if (st->variables)
			newvars = (Variable *) xrealloc(st->variables,
									(st->nvariables + 1) * sizeof(Variable));
		else
			newvars = (Variable *) xmalloc(sizeof(Variable));

		st->variables = newvars;

		var = &newvars[st->nvariables];

		var->name = xstrdup(name);
		var->value = xstrdup(value);

		st->nvariables++;

		qsort((void *) st->variables, st->nvariables, sizeof(Variable),
			  compareVariables);
	}
	else
	{
		char	   *val;

		/* dup then free, in case value is pointing at this variable */
		val = xstrdup(value);

		free(var->value);
		var->value = val;
	}

	return true;
}

static char *
parseVariable(const char *sql, int *eaten)
{
	int			i = 0;
	char	   *name;

	do
	{
		i++;
	} while (isalnum((unsigned char) sql[i]) || sql[i] == '_');
	if (i == 1)
		return NULL;

	name = xmalloc(i);
	memcpy(name, &sql[1], i - 1);
	name[i - 1] = '\0';

	*eaten = i;
	return name;
}

static char *
replaceVariable(char **sql, char *param, int len, char *value)
{
	int			valueln = strlen(value);

	if (valueln > len)
	{
		size_t		offset = param - *sql;

		*sql = xrealloc(*sql, strlen(*sql) - len + valueln + 1);
		param = *sql + offset;
	}

	if (valueln != len)
		memmove(param + valueln, param + len, strlen(param + len) + 1);
	strncpy(param, value, valueln);

	return param + valueln;
}

static char *
assignVariables(CState *st, char *sql)
{
	char	   *p,
			   *name,
			   *val;

	p = sql;
	while ((p = strchr(p, ':')) != NULL)
	{
		int			eaten;

		name = parseVariable(p, &eaten);
		if (name == NULL)
		{
			while (*p == ':')
			{
				p++;
			}
			continue;
		}

		val = getVariable(st, name);
		free(name);
		if (val == NULL)
		{
			p++;
			continue;
		}

		p = replaceVariable(&sql, p, eaten, val);
	}

	return sql;
}

static void
getQueryParams(CState *st, const Command *command, const char **params)
{
	int			i;

	for (i = 0; i < command->argc - 1; i++)
		params[i] = getVariable(st, command->argv[i + 1]);
}

/*
 * Run a shell command. The result is assigned to the variable if not NULL.
 * Return true if succeeded, or false on error.
 */
static bool
runShellCommand(CState *st, char *variable, char **argv, int argc)
{
	char		command[SHELL_COMMAND_SIZE];
	int			i,
				len = 0;
	FILE	   *fp;
	char		res[64];
	char	   *endptr;
	int			retval;

	/*----------
	 * Join arguments with whitespace separators. Arguments starting with
	 * exactly one colon are treated as variables:
	 *	name - append a string "name"
	 *	:var - append a variable named 'var'
	 *	::name - append a string ":name"
	 *----------
	 */
	for (i = 0; i < argc; i++)
	{
		char	   *arg;
		int			arglen;

		if (argv[i][0] != ':')
		{
			arg = argv[i];		/* a string literal */
		}
		else if (argv[i][1] == ':')
		{
			arg = argv[i] + 1;	/* a string literal starting with colons */
		}
		else if ((arg = getVariable(st, argv[i] + 1)) == NULL)
		{
			fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[i]);
			return false;
		}

		arglen = strlen(arg);
		if (len + arglen + (i > 0 ? 1 : 0) >= SHELL_COMMAND_SIZE - 1)
		{
			fprintf(stderr, "%s: too long shell command\n", argv[0]);
			return false;
		}

		if (i > 0)
			command[len++] = ' ';
		memcpy(command + len, arg, arglen);
		len += arglen;
	}

	command[len] = '\0';

	/* Fast path for non-assignment case */
	if (variable == NULL)
	{
		if (system(command))
		{
			if (!timer_exceeded)
				fprintf(stderr, "%s: cannot launch shell command\n", argv[0]);
			return false;
		}
		return true;
	}

	/* Execute the command with pipe and read the standard output. */
	if ((fp = popen(command, "r")) == NULL)
	{
		fprintf(stderr, "%s: cannot launch shell command\n", argv[0]);
		return false;
	}
	if (fgets(res, sizeof(res), fp) == NULL)
	{
		if (!timer_exceeded)
			fprintf(stderr, "%s: cannot read the result\n", argv[0]);
		return false;
	}
	if (pclose(fp) < 0)
	{
		fprintf(stderr, "%s: cannot close shell command\n", argv[0]);
		return false;
	}

	/* Check whether the result is an integer and assign it to the variable */
	retval = (int) strtol(res, &endptr, 10);
	while (*endptr != '\0' && isspace((unsigned char) *endptr))
		endptr++;
	if (*res == '\0' || *endptr != '\0')
	{
		fprintf(stderr, "%s: must return an integer ('%s' returned)\n", argv[0], res);
		return false;
	}
	snprintf(res, sizeof(res), "%d", retval);
	if (!putVariable(st, "setshell", variable, res))
		return false;

#ifdef DEBUG
	printf("shell parameter name: %s, value: %s\n", argv[1], res);
#endif
	return true;
}

#define MAX_PREPARE_NAME		32
static void
preparedStatementName(char *buffer, int file, int state)
{
	sprintf(buffer, "P%d_%d", file, state);
}

static bool
clientDone(CState *st, bool ok)
{
	(void) ok;					/* unused */

	if (st->con != NULL)
	{
		PQfinish(st->con);
		st->con = NULL;
	}
	return false;				/* always false */
}

/* return false iff client should be disconnected */
static bool
doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile)
{
	PGresult   *res;
	Command   **commands;

top:
	commands = sql_files[st->use_file];

	if (st->sleeping)
	{							/* are we sleeping? */
		instr_time	now;

		INSTR_TIME_SET_CURRENT(now);
		if (st->until <= INSTR_TIME_GET_MICROSEC(now))
			st->sleeping = 0;	/* Done sleeping, go ahead with next command */
		else
			return true;		/* Still sleeping, nothing to do here */
	}

	if (st->listen)
	{							/* are we receiver? */
		if (commands[st->state]->type == SQL_COMMAND)
		{
			if (debug)
				fprintf(stderr, "client %d receiving\n", st->id);
			if (!PQconsumeInput(st->con))
			{					/* there's something wrong */
				fprintf(stderr, "Client %d aborted in state %d. Probably the backend died while processing.\n", st->id, st->state);
				return clientDone(st, false);
			}
			if (PQisBusy(st->con))
				return true;	/* don't have the whole result yet */
		}

		/*
		 * command finished: accumulate per-command execution times in
		 * thread-local data structure, if per-command latencies are requested
		 */
		if (is_latencies)
		{
			instr_time	now;
			int			cnum = commands[st->state]->command_num;

			INSTR_TIME_SET_CURRENT(now);
			INSTR_TIME_ACCUM_DIFF(thread->exec_elapsed[cnum],
								  now, st->stmt_begin);
			thread->exec_count[cnum]++;
		}

		/*
		 * if transaction finished, record the time it took in the log
		 */
		if (logfile && commands[st->state + 1] == NULL)
		{
			instr_time	now;
			instr_time	diff;
			double		usec;

			INSTR_TIME_SET_CURRENT(now);
			diff = now;
			INSTR_TIME_SUBTRACT(diff, st->txn_begin);
			usec = (double) INSTR_TIME_GET_MICROSEC(diff);

#ifndef WIN32
			/* This is more than we really ought to know about instr_time */
			fprintf(logfile, "%d %d %.0f %d %ld %ld\n",
					st->id, st->cnt, usec, st->use_file,
					(long) now.tv_sec, (long) now.tv_usec);
#else
			/* On Windows, instr_time doesn't provide a timestamp anyway */
			fprintf(logfile, "%d %d %.0f %d 0 0\n",
					st->id, st->cnt, usec, st->use_file);
#endif
		}

		if (commands[st->state]->type == SQL_COMMAND)
		{
			/*
			 * Read and discard the query result; note this is not included in
			 * the statement latency numbers.
			 */
			res = PQgetResult(st->con);
			switch (PQresultStatus(res))
			{
				case PGRES_COMMAND_OK:
				case PGRES_TUPLES_OK:
					break;		/* OK */
				default:
					fprintf(stderr, "Client %d aborted in state %d: %s",
							st->id, st->state, PQerrorMessage(st->con));
					PQclear(res);
					return clientDone(st, false);
			}
			PQclear(res);
			discard_response(st);
		}

		if (commands[st->state + 1] == NULL)
		{
			if (is_connect)
			{
				PQfinish(st->con);
				st->con = NULL;
			}

			++st->cnt;
			if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
				return clientDone(st, true);	/* exit success */
		}

		/* increment state counter */
		st->state++;
		if (commands[st->state] == NULL)
		{
			st->state = 0;
			st->use_file = getrand(thread, 0, num_files - 1);
			commands = sql_files[st->use_file];
		}
	}

	if (st->con == NULL)
	{
		instr_time	start,
					end;

		INSTR_TIME_SET_CURRENT(start);
		if ((st->con = doConnect()) == NULL)
		{
			fprintf(stderr, "Client %d aborted in establishing connection.\n", st->id);
			return clientDone(st, false);
		}
		INSTR_TIME_SET_CURRENT(end);
		INSTR_TIME_ACCUM_DIFF(*conn_time, end, start);
	}

	/* Record transaction start time if logging is enabled */
	if (logfile && st->state == 0)
		INSTR_TIME_SET_CURRENT(st->txn_begin);

	/* Record statement start time if per-command latencies are requested */
	if (is_latencies)
		INSTR_TIME_SET_CURRENT(st->stmt_begin);

	if (commands[st->state]->type == SQL_COMMAND)
	{
		const Command *command = commands[st->state];
		int			r;

		if (querymode == QUERY_SIMPLE)
		{
			char	   *sql;

			sql = xstrdup(command->argv[0]);
			sql = assignVariables(st, sql);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, sql);
			r = PQsendQuery(st->con, sql);
			free(sql);
		}
		else if (querymode == QUERY_EXTENDED)
		{
			const char *sql = command->argv[0];
			const char *params[MAX_ARGS];

			getQueryParams(st, command, params);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, sql);
			r = PQsendQueryParams(st->con, sql, command->argc - 1,
								  NULL, params, NULL, NULL, 0);
		}
		else if (querymode == QUERY_PREPARED)
		{
			char		name[MAX_PREPARE_NAME];
			const char *params[MAX_ARGS];

			if (!st->prepared[st->use_file])
			{
				int			j;

				for (j = 0; commands[j] != NULL; j++)
				{
					PGresult   *res;
					char		name[MAX_PREPARE_NAME];

					if (commands[j]->type != SQL_COMMAND)
						continue;
					preparedStatementName(name, st->use_file, j);
					res = PQprepare(st->con, name,
						  commands[j]->argv[0], commands[j]->argc - 1, NULL);
					if (PQresultStatus(res) != PGRES_COMMAND_OK)
						fprintf(stderr, "%s", PQerrorMessage(st->con));
					PQclear(res);
				}
				st->prepared[st->use_file] = true;
			}

			getQueryParams(st, command, params);
			preparedStatementName(name, st->use_file, st->state);

			if (debug)
				fprintf(stderr, "client %d sending %s\n", st->id, name);
			r = PQsendQueryPrepared(st->con, name, command->argc - 1,
									params, NULL, NULL, 0);
		}
		else	/* unknown sql mode */
			r = 0;

		if (r == 0)
		{
			if (debug)
				fprintf(stderr, "client %d cannot send %s\n", st->id, command->argv[0]);
			st->ecnt++;
		}
		else
			st->listen = 1;		/* flags that should be listened */
	}
	else if (commands[st->state]->type == META_COMMAND)
	{
		int			argc = commands[st->state]->argc,
					i;
		char	  **argv = commands[st->state]->argv;

		if (debug)
		{
			fprintf(stderr, "client %d executing \\%s", st->id, argv[0]);
			for (i = 1; i < argc; i++)
				fprintf(stderr, " %s", argv[i]);
			fprintf(stderr, "\n");
		}

		if (pg_strcasecmp(argv[0], "setrandom") == 0)
		{
			char	   *var;
			int			min,
						max;
			char		res[64];

			if (*argv[2] == ':')
			{
				if ((var = getVariable(st, argv[2] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[2]);
					st->ecnt++;
					return true;
				}
				min = atoi(var);
			}
			else
				min = atoi(argv[2]);

#ifdef NOT_USED
			if (min < 0)
			{
				fprintf(stderr, "%s: invalid minimum number %d\n", argv[0], min);
				st->ecnt++;
				return;
			}
#endif

			if (*argv[3] == ':')
			{
				if ((var = getVariable(st, argv[3] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[3]);
					st->ecnt++;
					return true;
				}
				max = atoi(var);
			}
			else
				max = atoi(argv[3]);

			if (max < min)
			{
				fprintf(stderr, "%s: maximum is less than minimum\n", argv[0]);
				st->ecnt++;
				return true;
			}

			/*
			 * getrand() neeeds to be able to subtract max from min and add
			 * one the result without overflowing.	Since we know max > min,
			 * we can detect overflow just by checking for a negative result.
			 * But we must check both that the subtraction doesn't overflow,
			 * and that adding one to the result doesn't overflow either.
			 */
			if (max - min < 0 || (max - min) + 1 < 0)
			{
				fprintf(stderr, "%s: range too large\n", argv[0]);
				st->ecnt++;
				return true;
			}

#ifdef DEBUG
			printf("min: %d max: %d random: %d\n", min, max, getrand(thread, min, max));
#endif
			snprintf(res, sizeof(res), "%d", getrand(thread, min, max));

			if (!putVariable(st, argv[0], argv[1], res))
			{
				st->ecnt++;
				return true;
			}

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "set") == 0)
		{
			char	   *var;
			int			ope1,
						ope2;
			char		res[64];

			if (*argv[2] == ':')
			{
				if ((var = getVariable(st, argv[2] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[2]);
					st->ecnt++;
					return true;
				}
				ope1 = atoi(var);
			}
			else
				ope1 = atoi(argv[2]);

			if (argc < 5)
				snprintf(res, sizeof(res), "%d", ope1);
			else
			{
				if (*argv[4] == ':')
				{
					if ((var = getVariable(st, argv[4] + 1)) == NULL)
					{
						fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[4]);
						st->ecnt++;
						return true;
					}
					ope2 = atoi(var);
				}
				else
					ope2 = atoi(argv[4]);

				if (strcmp(argv[3], "+") == 0)
					snprintf(res, sizeof(res), "%d", ope1 + ope2);
				else if (strcmp(argv[3], "-") == 0)
					snprintf(res, sizeof(res), "%d", ope1 - ope2);
				else if (strcmp(argv[3], "*") == 0)
					snprintf(res, sizeof(res), "%d", ope1 * ope2);
				else if (strcmp(argv[3], "/") == 0)
				{
					if (ope2 == 0)
					{
						fprintf(stderr, "%s: division by zero\n", argv[0]);
						st->ecnt++;
						return true;
					}
					snprintf(res, sizeof(res), "%d", ope1 / ope2);
				}
				else
				{
					fprintf(stderr, "%s: unsupported operator %s\n", argv[0], argv[3]);
					st->ecnt++;
					return true;
				}
			}

			if (!putVariable(st, argv[0], argv[1], res))
			{
				st->ecnt++;
				return true;
			}

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "sleep") == 0)
		{
			char	   *var;
			int			usec;
			instr_time	now;

			if (*argv[1] == ':')
			{
				if ((var = getVariable(st, argv[1] + 1)) == NULL)
				{
					fprintf(stderr, "%s: undefined variable %s\n", argv[0], argv[1]);
					st->ecnt++;
					return true;
				}
				usec = atoi(var);
			}
			else
				usec = atoi(argv[1]);

			if (argc > 2)
			{
				if (pg_strcasecmp(argv[2], "ms") == 0)
					usec *= 1000;
				else if (pg_strcasecmp(argv[2], "s") == 0)
					usec *= 1000000;
			}
			else
				usec *= 1000000;

			INSTR_TIME_SET_CURRENT(now);
			st->until = INSTR_TIME_GET_MICROSEC(now) + usec;
			st->sleeping = 1;

			st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "setshell") == 0)
		{
			bool		ret = runShellCommand(st, argv[1], argv + 2, argc - 2);

			if (timer_exceeded) /* timeout */
				return clientDone(st, true);
			else if (!ret)		/* on error */
			{
				st->ecnt++;
				return true;
			}
			else	/* succeeded */
				st->listen = 1;
		}
		else if (pg_strcasecmp(argv[0], "shell") == 0)
		{
			bool		ret = runShellCommand(st, NULL, argv + 1, argc - 1);

			if (timer_exceeded) /* timeout */
				return clientDone(st, true);
			else if (!ret)		/* on error */
			{
				st->ecnt++;
				return true;
			}
			else	/* succeeded */
				st->listen = 1;
		}
		goto top;
	}

	return true;
}

/* discard connections */
static void
disconnect_all(CState *state, int length)
{
	int			i;

	for (i = 0; i < length; i++)
	{
		if (state[i].con)
		{
			PQfinish(state[i].con);
			state[i].con = NULL;
		}
	}
}

/* create tables and setup data */
static void
init(bool is_no_vacuum)
{
	/*
	 * Note: TPC-B requires at least 100 bytes per row, and the "filler"
	 * fields in these table declarations were intended to comply with that.
	 * But because they default to NULLs, they don't actually take any space.
	 * We could fix that by giving them non-null default values. However, that
	 * would completely break comparability of pgbench results with prior
	 * versions.  Since pgbench has never pretended to be fully TPC-B
	 * compliant anyway, we stick with the historical behavior.
	 */
	struct ddlinfo
	{
		char	   *table;
		char	   *cols;
		int			declare_fillfactor;
	};
	struct ddlinfo DDLs[] = {
		{
			"pgbench_history",
			"tid int,bid int,aid int,delta int,mtime timestamp,filler char(22)",
			0
		},
		{
			"pgbench_tellers",
			"tid int not null,bid int,tbalance int,filler char(92),"
			"tbalance1 int, filler1 varchar(152),tbalance2 int,filler2 char(1550)",
			1
		},
		{
			"pgbench_accounts",
			"aid int not null,bid int,abalance int,filler char(92),"
			"abalance1 int,filler1 varchar(152),abalance2 int,filler2 char(1550)",
			1
		},
		{
			"pgbench_branches",
			"bid int not null,bbalance int,filler char(92),bbalance1 int,"
			"filler1 varchar(152), bbalance2 int, filler2 char(1550)",
			1
		}
	};
	static char *DDLAFTERs[] = {
		"alter table pgbench_branches add primary key (bid)",
		"alter table pgbench_tellers add primary key (tid)",
		"alter table pgbench_accounts add primary key (aid)"
	};
	static char *DDLKEYs[] = {
		"alter table pgbench_tellers add foreign key (bid) references pgbench_branches",
		"alter table pgbench_accounts add foreign key (bid) references pgbench_branches",
		"alter table pgbench_history add foreign key (bid) references pgbench_branches",
		"alter table pgbench_history add foreign key (tid) references pgbench_tellers",
		"alter table pgbench_history add foreign key (aid) references pgbench_accounts"
	};

	PGconn	   *con;
	PGresult   *res;
	char		sql[256];
	int			i;

	if ((con = doConnect()) == NULL)
		exit(1);

	for (i = 0; i < lengthof(DDLs); i++)
	{
		char		opts[256];
		char		buffer[256];
		struct ddlinfo *ddl = &DDLs[i];

		/* Remove old table, if it exists. */
		snprintf(buffer, 256, "drop table if exists %s", ddl->table);
		executeStatement(con, buffer);

		/* Construct new create table statement. */
		opts[0] = '\0';
		if (ddl->declare_fillfactor)
			snprintf(opts + strlen(opts), 256 - strlen(opts),
					 " with (fillfactor=%d)", fillfactor);
		if (tablespace != NULL)
		{
			char	   *escape_tablespace;

			escape_tablespace = PQescapeIdentifier(con, tablespace,
												   strlen(tablespace));
			snprintf(opts + strlen(opts), 256 - strlen(opts),
					 " tablespace %s", escape_tablespace);
			PQfreemem(escape_tablespace);
		}
		snprintf(buffer, 256, "create%s table %s(%s)%s",
				 unlogged_tables ? " unlogged" : "",
				 ddl->table, ddl->cols, opts);

		executeStatement(con, buffer);
	}

	executeStatement(con, "begin");

	for (i = 0; i < nbranches * scale; i++)
	{
		snprintf(sql, 256, "insert into pgbench_branches values(%d,0,0,0,0,0,0)", i + 1);
		executeStatement(con, sql);
	}

	for (i = 0; i < ntellers * scale; i++)
	{
		snprintf(sql, 256, "insert into pgbench_tellers values (%d,%d,0,0,0,0,0,0)",
				 i + 1, i / ntellers + 1);
		executeStatement(con, sql);
	}

	executeStatement(con, "commit");

	/*
	 * fill the pgbench_accounts table with some data
	 */
	fprintf(stderr, "creating tables...\n");

	executeStatement(con, "begin");
	executeStatement(con, "truncate pgbench_accounts");

	res = PQexec(con, "copy pgbench_accounts from stdin");
	if (PQresultStatus(res) != PGRES_COPY_IN)
	{
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}
	PQclear(res);

	for (i = 0; i < naccounts * scale; i++)
	{
		int			j = i + 1;

		snprintf(sql, 256, "%d\t%d\t%d\t \t%d\t \t%d\t \n", j, i / naccounts + 1, 0,0,0);
		if (PQputline(con, sql))
		{
			fprintf(stderr, "PQputline failed\n");
			exit(1);
		}

		if (j % 100000 == 0)
			fprintf(stderr, "%d of %d tuples (%d%%) done.\n",
					j, naccounts * scale,
					j * 100 / (naccounts * scale));
	}
	if (PQputline(con, "\\.\n"))
	{
		fprintf(stderr, "very last PQputline failed\n");
		exit(1);
	}
	if (PQendcopy(con))
	{
		fprintf(stderr, "PQendcopy failed\n");
		exit(1);
	}
	executeStatement(con, "commit");

	/* vacuum */
	if (!is_no_vacuum)
	{
		fprintf(stderr, "vacuum...\n");
		executeStatement(con, "vacuum analyze pgbench_branches");
		executeStatement(con, "vacuum analyze pgbench_tellers");
		executeStatement(con, "vacuum analyze pgbench_accounts");
		executeStatement(con, "vacuum analyze pgbench_history");
	}

	/*
	 * create indexes
	 */
	fprintf(stderr, "set primary keys...\n");
	for (i = 0; i < lengthof(DDLAFTERs); i++)
	{
		char		buffer[256];

		strncpy(buffer, DDLAFTERs[i], 256);

		if (index_tablespace != NULL)
		{
			char	   *escape_tablespace;

			escape_tablespace = PQescapeIdentifier(con, index_tablespace,
												   strlen(index_tablespace));
			snprintf(buffer + strlen(buffer), 256 - strlen(buffer),
					 " using index tablespace %s", escape_tablespace);
			PQfreemem(escape_tablespace);
		}

		executeStatement(con, buffer);
	}

	/*
	 * create foreign keys
	 */
	if (foreign_keys)
	{
		fprintf(stderr, "set foreign keys...\n");
		for (i = 0; i < lengthof(DDLKEYs); i++)
		{
			executeStatement(con, DDLKEYs[i]);
		}
	}


	fprintf(stderr, "done.\n");
	PQfinish(con);
}

/*
 * Parse the raw sql and replace :param to $n.
 */
static bool
parseQuery(Command *cmd, const char *raw_sql)
{
	char	   *sql,
			   *p;

	sql = xstrdup(raw_sql);
	cmd->argc = 1;

	p = sql;
	while ((p = strchr(p, ':')) != NULL)
	{
		char		var[12];
		char	   *name;
		int			eaten;

		name = parseVariable(p, &eaten);
		if (name == NULL)
		{
			while (*p == ':')
			{
				p++;
			}
			continue;
		}

		if (cmd->argc >= MAX_ARGS)
		{
			fprintf(stderr, "statement has too many arguments (maximum is %d): %s\n", MAX_ARGS - 1, raw_sql);
			return false;
		}

		sprintf(var, "$%d", cmd->argc);
		p = replaceVariable(&sql, p, eaten, var);

		cmd->argv[cmd->argc] = name;
		cmd->argc++;
	}

	cmd->argv[0] = sql;
	return true;
}

/* Parse a command; return a Command struct, or NULL if it's a comment */
static Command *
process_commands(char *buf)
{
	const char	delim[] = " \f\n\r\t\v";

	Command    *my_commands;
	int			j;
	char	   *p,
			   *tok;

	/* Make the string buf end at the next newline */
	if ((p = strchr(buf, '\n')) != NULL)
		*p = '\0';

	/* Skip leading whitespace */
	p = buf;
	while (isspace((unsigned char) *p))
		p++;

	/* If the line is empty or actually a comment, we're done */
	if (*p == '\0' || strncmp(p, "--", 2) == 0)
		return NULL;

	/* Allocate and initialize Command structure */
	my_commands = (Command *) xmalloc(sizeof(Command));
	my_commands->line = xstrdup(buf);
	my_commands->command_num = num_commands++;
	my_commands->type = 0;		/* until set */
	my_commands->argc = 0;

	if (*p == '\\')
	{
		my_commands->type = META_COMMAND;

		j = 0;
		tok = strtok(++p, delim);

		while (tok != NULL)
		{
			my_commands->argv[j++] = xstrdup(tok);
			my_commands->argc++;
			tok = strtok(NULL, delim);
		}

		if (pg_strcasecmp(my_commands->argv[0], "setrandom") == 0)
		{
			if (my_commands->argc < 4)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			for (j = 4; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "set") == 0)
		{
			if (my_commands->argc < 3)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			for (j = my_commands->argc < 5 ? 3 : 5; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "sleep") == 0)
		{
			if (my_commands->argc < 2)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}

			/*
			 * Split argument into number and unit to allow "sleep 1ms" etc.
			 * We don't have to terminate the number argument with null
			 * because it will be parsed with atoi, which ignores trailing
			 * non-digit characters.
			 */
			if (my_commands->argv[1][0] != ':')
			{
				char	   *c = my_commands->argv[1];

				while (isdigit((unsigned char) *c))
					c++;
				if (*c)
				{
					my_commands->argv[2] = c;
					if (my_commands->argc < 3)
						my_commands->argc = 3;
				}
			}

			if (my_commands->argc >= 3)
			{
				if (pg_strcasecmp(my_commands->argv[2], "us") != 0 &&
					pg_strcasecmp(my_commands->argv[2], "ms") != 0 &&
					pg_strcasecmp(my_commands->argv[2], "s") != 0)
				{
					fprintf(stderr, "%s: unknown time unit '%s' - must be us, ms or s\n",
							my_commands->argv[0], my_commands->argv[2]);
					exit(1);
				}
			}

			for (j = 3; j < my_commands->argc; j++)
				fprintf(stderr, "%s: extra argument \"%s\" ignored\n",
						my_commands->argv[0], my_commands->argv[j]);
		}
		else if (pg_strcasecmp(my_commands->argv[0], "setshell") == 0)
		{
			if (my_commands->argc < 3)
			{
				fprintf(stderr, "%s: missing argument\n", my_commands->argv[0]);
				exit(1);
			}
		}
		else if (pg_strcasecmp(my_commands->argv[0], "shell") == 0)
		{
			if (my_commands->argc < 1)
			{
				fprintf(stderr, "%s: missing command\n", my_commands->argv[0]);
				exit(1);
			}
		}
		else
		{
			fprintf(stderr, "Invalid command %s\n", my_commands->argv[0]);
			exit(1);
		}
	}
	else
	{
		my_commands->type = SQL_COMMAND;

		switch (querymode)
		{
			case QUERY_SIMPLE:
				my_commands->argv[0] = xstrdup(p);
				my_commands->argc++;
				break;
			case QUERY_EXTENDED:
			case QUERY_PREPARED:
				if (!parseQuery(my_commands, p))
					exit(1);
				break;
			default:
				exit(1);
		}
	}

	return my_commands;
}

static int
process_file(char *filename)
{
#define COMMANDS_ALLOC_NUM 128

	Command   **my_commands;
	FILE	   *fd;
	int			lineno;
	char		buf[BUFSIZ];
	int			alloc_num;

	if (num_files >= MAX_FILES)
	{
		fprintf(stderr, "Up to only %d SQL files are allowed\n", MAX_FILES);
		exit(1);
	}

	alloc_num = COMMANDS_ALLOC_NUM;
	my_commands = (Command **) xmalloc(sizeof(Command *) * alloc_num);

	if (strcmp(filename, "-") == 0)
		fd = stdin;
	else if ((fd = fopen(filename, "r")) == NULL)
	{
		fprintf(stderr, "%s: %s\n", filename, strerror(errno));
		return false;
	}

	lineno = 0;

	while (fgets(buf, sizeof(buf), fd) != NULL)
	{
		Command    *command;

		command = process_commands(buf);
		if (command == NULL)
			continue;

		my_commands[lineno] = command;
		lineno++;

		if (lineno >= alloc_num)
		{
			alloc_num += COMMANDS_ALLOC_NUM;
			my_commands = xrealloc(my_commands, sizeof(Command *) * alloc_num);
		}
	}
	fclose(fd);

	my_commands[lineno] = NULL;

	sql_files[num_files++] = my_commands;

	return true;
}

static Command **
process_builtin(char *tb)
{
#define COMMANDS_ALLOC_NUM 128

	Command   **my_commands;
	int			lineno;
	char		buf[BUFSIZ];
	int			alloc_num;

	alloc_num = COMMANDS_ALLOC_NUM;
	my_commands = (Command **) xmalloc(sizeof(Command *) * alloc_num);

	lineno = 0;

	for (;;)
	{
		char	   *p;
		Command    *command;

		p = buf;
		while (*tb && *tb != '\n')
			*p++ = *tb++;

		if (*tb == '\0')
			break;

		if (*tb == '\n')
			tb++;

		*p = '\0';

		command = process_commands(buf);
		if (command == NULL)
			continue;

		my_commands[lineno] = command;
		lineno++;

		if (lineno >= alloc_num)
		{
			alloc_num += COMMANDS_ALLOC_NUM;
			my_commands = xrealloc(my_commands, sizeof(Command *) * alloc_num);
		}
	}

	my_commands[lineno] = NULL;

	return my_commands;
}

/* print out results */
static void
printResults(int ttype, int normal_xacts, int nclients,
			 TState *threads, int nthreads,
			 instr_time total_time, instr_time conn_total_time)
{
	double		time_include,
				tps_include,
				tps_exclude;
	char	   *s;

	time_include = INSTR_TIME_GET_DOUBLE(total_time);
	tps_include = normal_xacts / time_include;
	tps_exclude = normal_xacts / (time_include -
						(INSTR_TIME_GET_DOUBLE(conn_total_time) / nthreads));

	if (ttype == 0)
		s = "TPC-B (sort of)";
	else if (ttype == 2)
		s = "Update only pgbench_accounts";
	else if (ttype == 1)
		s = "SELECT only";
	else
		s = "Custom query";

	printf("transaction type: %s\n", s);
	printf("scaling factor: %d\n", scale);
	printf("query mode: %s\n", QUERYMODE[querymode]);
	printf("number of clients: %d\n", nclients);
	printf("number of threads: %d\n", nthreads);
	if (duration <= 0)
	{
		printf("number of transactions per client: %d\n", nxacts);
		printf("number of transactions actually processed: %d/%d\n",
			   normal_xacts, nxacts * nclients);
	}
	else
	{
		printf("duration: %d s\n", duration);
		printf("number of transactions actually processed: %d\n",
			   normal_xacts);
	}
	printf("tps = %f (including connections establishing)\n", tps_include);
	printf("tps = %f (excluding connections establishing)\n", tps_exclude);

	/* Report per-command latencies */
	if (is_latencies)
	{
		int			i;

		for (i = 0; i < num_files; i++)
		{
			Command   **commands;

			if (num_files > 1)
				printf("statement latencies in milliseconds, file %d:\n", i + 1);
			else
				printf("statement latencies in milliseconds:\n");

			for (commands = sql_files[i]; *commands != NULL; commands++)
			{
				Command    *command = *commands;
				int			cnum = command->command_num;
				double		total_time;
				instr_time	total_exec_elapsed;
				int			total_exec_count;
				int			t;

				/* Accumulate per-thread data for command */
				INSTR_TIME_SET_ZERO(total_exec_elapsed);
				total_exec_count = 0;
				for (t = 0; t < nthreads; t++)
				{
					TState	   *thread = &threads[t];

					INSTR_TIME_ADD(total_exec_elapsed,
								   thread->exec_elapsed[cnum]);
					total_exec_count += thread->exec_count[cnum];
				}

				if (total_exec_count > 0)
					total_time = INSTR_TIME_GET_MILLISEC(total_exec_elapsed) / (double) total_exec_count;
				else
					total_time = 0.0;

				printf("\t%f\t%s\n", total_time, command->line);
			}
		}
	}
}


int
main(int argc, char **argv)
{
	int			c;
	int			nclients = 1;	/* default number of simulated clients */
	int			nthreads = 1;	/* default number of threads */
	int			is_init_mode = 0;		/* initialize mode? */
	int			is_no_vacuum = 0;		/* no vacuum at all before testing? */
	int			do_vacuum_accounts = 0; /* do vacuum accounts before testing? */
	int			ttype = 0;		/* transaction type. 0: TPC-B, 1: SELECT only,
								 * 2: skip update of branches and tellers */
	int			optindex;
	char	   *filename = NULL;
	bool		scale_given = false;

	CState	   *state;			/* status of clients */
	TState	   *threads;		/* array of thread */

	instr_time	start_time;		/* start up time */
	instr_time	total_time;
	instr_time	conn_total_time;
	int			total_xacts;

	int			i;

	static struct option long_options[] = {
		{"foreign-keys", no_argument, &foreign_keys, 1},
		{"index-tablespace", required_argument, NULL, 3},
		{"tablespace", required_argument, NULL, 2},
		{"unlogged-tables", no_argument, &unlogged_tables, 1},
		{NULL, 0, NULL, 0}
	};

#ifdef HAVE_GETRLIMIT
	struct rlimit rlim;
#endif

	PGconn	   *con;
	PGresult   *res;
	char	   *env;

	char		val[64];

	progname = get_progname(argv[0]);

	if (argc > 1)
	{
		if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-?") == 0)
		{
			usage();
			exit(0);
		}
		if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
		{
			puts("pgbench (PostgreSQL) " PG_VERSION);
			exit(0);
		}
	}

#ifdef WIN32
	/* stderr is buffered on Win32. */
	setvbuf(stderr, NULL, _IONBF, 0);
#endif

	if ((env = getenv("PGHOST")) != NULL && *env != '\0')
		pghost = env;
	if ((env = getenv("PGPORT")) != NULL && *env != '\0')
		pgport = env;
	else if ((env = getenv("PGUSER")) != NULL && *env != '\0')
		login = env;

	state = (CState *) xmalloc(sizeof(CState));
	memset(state, 0, sizeof(CState));

	while ((c = getopt_long(argc, argv, "ih:nvp:dSNc:j:Crs:t:T:U:lf:D:F:M:", long_options, &optindex)) != -1)
	{
		switch (c)
		{
			case 'i':
				is_init_mode++;
				break;
			case 'h':
				pghost = optarg;
				break;
			case 'n':
				is_no_vacuum++;
				break;
			case 'v':
				do_vacuum_accounts++;
				break;
			case 'p':
				pgport = optarg;
				break;
			case 'd':
				debug++;
				break;
			case 'S':
				ttype = 1;
				break;
			case 'N':
				ttype = 2;
				break;
			case 'c':
				nclients = atoi(optarg);
				if (nclients <= 0 || nclients > MAXCLIENTS)
				{
					fprintf(stderr, "invalid number of clients: %d\n", nclients);
					exit(1);
				}
#ifdef HAVE_GETRLIMIT
#ifdef RLIMIT_NOFILE			/* most platforms use RLIMIT_NOFILE */
				if (getrlimit(RLIMIT_NOFILE, &rlim) == -1)
#else							/* but BSD doesn't ... */
				if (getrlimit(RLIMIT_OFILE, &rlim) == -1)
#endif   /* RLIMIT_NOFILE */
				{
					fprintf(stderr, "getrlimit failed: %s\n", strerror(errno));
					exit(1);
				}
				if (rlim.rlim_cur <= (nclients + 2))
				{
					fprintf(stderr, "You need at least %d open files but you are only allowed to use %ld.\n", nclients + 2, (long) rlim.rlim_cur);
					fprintf(stderr, "Use limit/ulimit to increase the limit before using pgbench.\n");
					exit(1);
				}
#endif   /* HAVE_GETRLIMIT */
				break;
			case 'j':			/* jobs */
				nthreads = atoi(optarg);
				if (nthreads <= 0)
				{
					fprintf(stderr, "invalid number of threads: %d\n", nthreads);
					exit(1);
				}
				break;
			case 'C':
				is_connect = true;
				break;
			case 'r':
				is_latencies = true;
				break;
			case 's':
				scale_given = true;
				scale = atoi(optarg);
				if (scale <= 0)
				{
					fprintf(stderr, "invalid scaling factor: %d\n", scale);
					exit(1);
				}
				break;
			case 't':
				if (duration > 0)
				{
					fprintf(stderr, "specify either a number of transactions (-t) or a duration (-T), not both.\n");
					exit(1);
				}
				nxacts = atoi(optarg);
				if (nxacts <= 0)
				{
					fprintf(stderr, "invalid number of transactions: %d\n", nxacts);
					exit(1);
				}
				break;
			case 'T':
				if (nxacts > 0)
				{
					fprintf(stderr, "specify either a number of transactions (-t) or a duration (-T), not both.\n");
					exit(1);
				}
				duration = atoi(optarg);
				if (duration <= 0)
				{
					fprintf(stderr, "invalid duration: %d\n", duration);
					exit(1);
				}
				break;
			case 'U':
				login = optarg;
				break;
			case 'l':
				use_log = true;
				break;
			case 'f':
				ttype = 3;
				filename = optarg;
				if (process_file(filename) == false || *sql_files[num_files - 1] == NULL)
					exit(1);
				break;
			case 'D':
				{
					char	   *p;

					if ((p = strchr(optarg, '=')) == NULL || p == optarg || *(p + 1) == '\0')
					{
						fprintf(stderr, "invalid variable definition: %s\n", optarg);
						exit(1);
					}

					*p++ = '\0';
					if (!putVariable(&state[0], "option", optarg, p))
						exit(1);
				}
				break;
			case 'F':
				fillfactor = atoi(optarg);
				if ((fillfactor < 10) || (fillfactor > 100))
				{
					fprintf(stderr, "invalid fillfactor: %d\n", fillfactor);
					exit(1);
				}
				break;
			case 'M':
				if (num_files > 0)
				{
					fprintf(stderr, "query mode (-M) should be specifiled before transaction scripts (-f)\n");
					exit(1);
				}
				for (querymode = 0; querymode < NUM_QUERYMODE; querymode++)
					if (strcmp(optarg, QUERYMODE[querymode]) == 0)
						break;
				if (querymode >= NUM_QUERYMODE)
				{
					fprintf(stderr, "invalid query mode (-M): %s\n", optarg);
					exit(1);
				}
				break;
			case 0:
				/* This covers long options which take no argument. */
				break;
			case 2:				/* tablespace */
				tablespace = optarg;
				break;
			case 3:				/* index-tablespace */
				index_tablespace = optarg;
				break;
			default:
				fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
				exit(1);
				break;
		}
	}

	if (argc > optind)
		dbName = argv[optind];
	else
	{
		if ((env = getenv("PGDATABASE")) != NULL && *env != '\0')
			dbName = env;
		else if (login != NULL && *login != '\0')
			dbName = login;
		else
			dbName = "";
	}

	if (is_init_mode)
	{
		init(is_no_vacuum);
		exit(0);
	}

	/* Use DEFAULT_NXACTS if neither nxacts nor duration is specified. */
	if (nxacts <= 0 && duration <= 0)
		nxacts = DEFAULT_NXACTS;

	if (nclients % nthreads != 0)
	{
		fprintf(stderr, "number of clients (%d) must be a multiple of number of threads (%d)\n", nclients, nthreads);
		exit(1);
	}

	/*
	 * is_latencies only works with multiple threads in thread-based
	 * implementations, not fork-based ones, because it supposes that the
	 * parent can see changes made to the per-thread execution stats by child
	 * threads.  It seems useful enough to accept despite this limitation, but
	 * perhaps we should FIXME someday (by passing the stats data back up
	 * through the parent-to-child pipes).
	 */
#ifndef ENABLE_THREAD_SAFETY
	if (is_latencies && nthreads > 1)
	{
		fprintf(stderr, "-r does not work with -j larger than 1 on this platform.\n");
		exit(1);
	}
#endif

	/*
	 * save main process id in the global variable because process id will be
	 * changed after fork.
	 */
	main_pid = (int) getpid();

	if (nclients > 1)
	{
		state = (CState *) xrealloc(state, sizeof(CState) * nclients);
		memset(state + 1, 0, sizeof(CState) * (nclients - 1));

		/* copy any -D switch values to all clients */
		for (i = 1; i < nclients; i++)
		{
			int			j;

			state[i].id = i;
			for (j = 0; j < state[0].nvariables; j++)
			{
				if (!putVariable(&state[i], "startup", state[0].variables[j].name, state[0].variables[j].value))
					exit(1);
			}
		}
	}

	if (debug)
	{
		if (duration <= 0)
			printf("pghost: %s pgport: %s nclients: %d nxacts: %d dbName: %s\n",
				   pghost, pgport, nclients, nxacts, dbName);
		else
			printf("pghost: %s pgport: %s nclients: %d duration: %d dbName: %s\n",
				   pghost, pgport, nclients, duration, dbName);
	}

	/* opening connection... */
	con = doConnect();
	if (con == NULL)
		exit(1);

	if (PQstatus(con) == CONNECTION_BAD)
	{
		fprintf(stderr, "Connection to database '%s' failed.\n", dbName);
		fprintf(stderr, "%s", PQerrorMessage(con));
		exit(1);
	}

	if (ttype != 3)
	{
		/*
		 * get the scaling factor that should be same as count(*) from
		 * pgbench_branches if this is not a custom query
		 */
		res = PQexec(con, "select count(*) from pgbench_branches");
		if (PQresultStatus(res) != PGRES_TUPLES_OK)
		{
			fprintf(stderr, "%s", PQerrorMessage(con));
			exit(1);
		}
		scale = atoi(PQgetvalue(res, 0, 0));
		if (scale < 0)
		{
			fprintf(stderr, "count(*) from pgbench_branches invalid (%d)\n", scale);
			exit(1);
		}
		PQclear(res);

		/* warn if we override user-given -s switch */
		if (scale_given)
			fprintf(stderr,
			"Scale option ignored, using pgbench_branches table count = %d\n",
					scale);
	}

	/*
	 * :scale variables normally get -s or database scale, but don't override
	 * an explicit -D switch
	 */
	if (getVariable(&state[0], "scale") == NULL)
	{
		snprintf(val, sizeof(val), "%d", scale);
		for (i = 0; i < nclients; i++)
		{
			if (!putVariable(&state[i], "startup", "scale", val))
				exit(1);
		}
	}

	if (!is_no_vacuum)
	{
		fprintf(stderr, "starting vacuum...");
		executeStatement(con, "vacuum pgbench_branches");
		executeStatement(con, "vacuum pgbench_tellers");
		executeStatement(con, "truncate pgbench_history");
		fprintf(stderr, "end.\n");

		if (do_vacuum_accounts)
		{
			fprintf(stderr, "starting vacuum pgbench_accounts...");
			executeStatement(con, "vacuum analyze pgbench_accounts");
			fprintf(stderr, "end.\n");
		}
	}
	PQfinish(con);

	/* set random seed */
	INSTR_TIME_SET_CURRENT(start_time);
	srandom((unsigned int) INSTR_TIME_GET_MICROSEC(start_time));

	/* process builtin SQL scripts */
	switch (ttype)
	{
		case 0:
			sql_files[0] = process_builtin(tpc_b);
			num_files = 1;
			break;

		case 1:
			sql_files[0] = process_builtin(select_only);
			num_files = 1;
			break;

		case 2:
			sql_files[0] = process_builtin(simple_update);
			num_files = 1;
			break;

		default:
			break;
	}

	/* set up thread data structures */
	threads = (TState *) xmalloc(sizeof(TState) * nthreads);
	for (i = 0; i < nthreads; i++)
	{
		TState	   *thread = &threads[i];

		thread->tid = i;
		thread->state = &state[nclients / nthreads * i];
		thread->nstate = nclients / nthreads;
		thread->random_state[0] = random();
		thread->random_state[1] = random();
		thread->random_state[2] = random();

		if (is_latencies)
		{
			/* Reserve memory for the thread to store per-command latencies */
			int			t;

			thread->exec_elapsed = (instr_time *)
				xmalloc(sizeof(instr_time) * num_commands);
			thread->exec_count = (int *)
				xmalloc(sizeof(int) * num_commands);

			for (t = 0; t < num_commands; t++)
			{
				INSTR_TIME_SET_ZERO(thread->exec_elapsed[t]);
				thread->exec_count[t] = 0;
			}
		}
		else
		{
			thread->exec_elapsed = NULL;
			thread->exec_count = NULL;
		}
	}

	/* get start up time */
	INSTR_TIME_SET_CURRENT(start_time);

	/* set alarm if duration is specified. */
	if (duration > 0)
		setalarm(duration);

	/* start threads */
	for (i = 0; i < nthreads; i++)
	{
		TState	   *thread = &threads[i];

		INSTR_TIME_SET_CURRENT(thread->start_time);

		/* the first thread (i = 0) is executed by main thread */
		if (i > 0)
		{
			int			err = pthread_create(&thread->thread, NULL, threadRun, thread);

			if (err != 0 || thread->thread == INVALID_THREAD)
			{
				fprintf(stderr, "cannot create thread: %s\n", strerror(err));
				exit(1);
			}
		}
		else
		{
			thread->thread = INVALID_THREAD;
		}
	}

	/* wait for threads and accumulate results */
	total_xacts = 0;
	INSTR_TIME_SET_ZERO(conn_total_time);
	for (i = 0; i < nthreads; i++)
	{
		void	   *ret = NULL;

		if (threads[i].thread == INVALID_THREAD)
			ret = threadRun(&threads[i]);
		else
			pthread_join(threads[i].thread, &ret);

		if (ret != NULL)
		{
			TResult    *r = (TResult *) ret;

			total_xacts += r->xacts;
			INSTR_TIME_ADD(conn_total_time, r->conn_time);
			free(ret);
		}
	}
	disconnect_all(state, nclients);

	/* get end time */
	INSTR_TIME_SET_CURRENT(total_time);
	INSTR_TIME_SUBTRACT(total_time, start_time);
	printResults(ttype, total_xacts, nclients, threads, nthreads,
				 total_time, conn_total_time);

	return 0;
}

static void *
threadRun(void *arg)
{
	TState	   *thread = (TState *) arg;
	CState	   *state = thread->state;
	TResult    *result;
	FILE	   *logfile = NULL; /* per-thread log file */
	instr_time	start,
				end;
	int			nstate = thread->nstate;
	int			remains = nstate;		/* number of remaining clients */
	int			i;

	result = xmalloc(sizeof(TResult));
	INSTR_TIME_SET_ZERO(result->conn_time);

	/* open log file if requested */
	if (use_log)
	{
		char		logpath[64];

		if (thread->tid == 0)
			snprintf(logpath, sizeof(logpath), "pgbench_log.%d", main_pid);
		else
			snprintf(logpath, sizeof(logpath), "pgbench_log.%d.%d", main_pid, thread->tid);
		logfile = fopen(logpath, "w");

		if (logfile == NULL)
		{
			fprintf(stderr, "Couldn't open logfile \"%s\": %s", logpath, strerror(errno));
			goto done;
		}
	}

	if (!is_connect)
	{
		/* make connections to the database */
		for (i = 0; i < nstate; i++)
		{
			if ((state[i].con = doConnect()) == NULL)
				goto done;
		}
	}

	/* time after thread and connections set up */
	INSTR_TIME_SET_CURRENT(result->conn_time);
	INSTR_TIME_SUBTRACT(result->conn_time, thread->start_time);

	/* send start up queries in async manner */
	for (i = 0; i < nstate; i++)
	{
		CState	   *st = &state[i];
		Command   **commands = sql_files[st->use_file];
		int			prev_ecnt = st->ecnt;

		st->use_file = getrand(thread, 0, num_files - 1);
		if (!doCustom(thread, st, &result->conn_time, logfile))
			remains--;			/* I've aborted */

		if (st->ecnt > prev_ecnt && commands[st->state]->type == META_COMMAND)
		{
			fprintf(stderr, "Client %d aborted in state %d. Execution meta-command failed.\n", i, st->state);
			remains--;			/* I've aborted */
			PQfinish(st->con);
			st->con = NULL;
		}
	}

	while (remains > 0)
	{
		fd_set		input_mask;
		int			maxsock;	/* max socket number to be waited */
		int64		now_usec = 0;
		int64		min_usec;

		FD_ZERO(&input_mask);

		maxsock = -1;
		min_usec = INT64_MAX;
		for (i = 0; i < nstate; i++)
		{
			CState	   *st = &state[i];
			Command   **commands = sql_files[st->use_file];
			int			sock;

			if (st->sleeping)
			{
				int			this_usec;

				if (min_usec == INT64_MAX)
				{
					instr_time	now;

					INSTR_TIME_SET_CURRENT(now);
					now_usec = INSTR_TIME_GET_MICROSEC(now);
				}

				this_usec = st->until - now_usec;
				if (min_usec > this_usec)
					min_usec = this_usec;
			}
			else if (st->con == NULL)
			{
				continue;
			}
			else if (commands[st->state]->type == META_COMMAND)
			{
				min_usec = 0;	/* the connection is ready to run */
				break;
			}

			sock = PQsocket(st->con);
			if (sock < 0)
			{
				fprintf(stderr, "bad socket: %s\n", strerror(errno));
				goto done;
			}

			FD_SET(sock, &input_mask);

			if (maxsock < sock)
				maxsock = sock;
		}

		if (min_usec > 0 && maxsock != -1)
		{
			int			nsocks; /* return from select(2) */

			if (min_usec != INT64_MAX)
			{
				struct timeval timeout;

				timeout.tv_sec = min_usec / 1000000;
				timeout.tv_usec = min_usec % 1000000;
				nsocks = select(maxsock + 1, &input_mask, NULL, NULL, &timeout);
			}
			else
				nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL);
			if (nsocks < 0)
			{
				if (errno == EINTR)
					continue;
				/* must be something wrong */
				fprintf(stderr, "select failed: %s\n", strerror(errno));
				goto done;
			}
		}

		/* ok, backend returns reply */
		for (i = 0; i < nstate; i++)
		{
			CState	   *st = &state[i];
			Command   **commands = sql_files[st->use_file];
			int			prev_ecnt = st->ecnt;

			if (st->con && (FD_ISSET(PQsocket(st->con), &input_mask)
							|| commands[st->state]->type == META_COMMAND))
			{
				if (!doCustom(thread, st, &result->conn_time, logfile))
					remains--;	/* I've aborted */
			}

			if (st->ecnt > prev_ecnt && commands[st->state]->type == META_COMMAND)
			{
				fprintf(stderr, "Client %d aborted in state %d. Execution of meta-command failed.\n", i, st->state);
				remains--;		/* I've aborted */
				PQfinish(st->con);
				st->con = NULL;
			}
		}
	}

done:
	INSTR_TIME_SET_CURRENT(start);
	disconnect_all(state, nstate);
	result->xacts = 0;
	for (i = 0; i < nstate; i++)
		result->xacts += state[i].cnt;
	INSTR_TIME_SET_CURRENT(end);
	INSTR_TIME_ACCUM_DIFF(result->conn_time, end, start);
	if (logfile)
		fclose(logfile);
	return result;
}


/*
 * Support for duration option: set timer_exceeded after so many seconds.
 */

#ifndef WIN32

static void
handle_sig_alarm(SIGNAL_ARGS)
{
	timer_exceeded = true;
}

static void
setalarm(int seconds)
{
	pqsignal(SIGALRM, handle_sig_alarm);
	alarm(seconds);
}

#ifndef ENABLE_THREAD_SAFETY

/*
 * implements pthread using fork.
 */

typedef struct fork_pthread
{
	pid_t		pid;
	int			pipes[2];
}	fork_pthread;

static int
pthread_create(pthread_t *thread,
			   pthread_attr_t *attr,
			   void *(*start_routine) (void *),
			   void *arg)
{
	fork_pthread *th;
	void	   *ret;

	th = (fork_pthread *) xmalloc(sizeof(fork_pthread));
	if (pipe(th->pipes) < 0)
	{
		free(th);
		return errno;
	}

	th->pid = fork();
	if (th->pid == -1)			/* error */
	{
		free(th);
		return errno;
	}
	if (th->pid != 0)			/* in parent process */
	{
		close(th->pipes[1]);
		*thread = th;
		return 0;
	}

	/* in child process */
	close(th->pipes[0]);

	/* set alarm again because the child does not inherit timers */
	if (duration > 0)
		setalarm(duration);

	ret = start_routine(arg);
	write(th->pipes[1], ret, sizeof(TResult));
	close(th->pipes[1]);
	free(th);
	exit(0);
}

static int
pthread_join(pthread_t th, void **thread_return)
{
	int			status;

	while (waitpid(th->pid, &status, 0) != th->pid)
	{
		if (errno != EINTR)
			return errno;
	}

	if (thread_return != NULL)
	{
		/* assume result is TResult */
		*thread_return = xmalloc(sizeof(TResult));
		if (read(th->pipes[0], *thread_return, sizeof(TResult)) != sizeof(TResult))
		{
			free(*thread_return);
			*thread_return = NULL;
		}
	}
	close(th->pipes[0]);

	free(th);
	return 0;
}
#endif
#else							/* WIN32 */

static VOID CALLBACK
win32_timer_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
{
	timer_exceeded = true;
}

static void
setalarm(int seconds)
{
	HANDLE		queue;
	HANDLE		timer;

	/* This function will be called at most once, so we can cheat a bit. */
	queue = CreateTimerQueue();
	if (seconds > ((DWORD) -1) / 1000 ||
		!CreateTimerQueueTimer(&timer, queue,
							   win32_timer_callback, NULL, seconds * 1000, 0,
							   WT_EXECUTEINTIMERTHREAD | WT_EXECUTEONLYONCE))
	{
		fprintf(stderr, "Failed to set timer\n");
		exit(1);
	}
}

/* partial pthread implementation for Windows */

typedef struct win32_pthread
{
	HANDLE		handle;
	void	   *(*routine) (void *);
	void	   *arg;
	void	   *result;
} win32_pthread;

static unsigned __stdcall
win32_pthread_run(void *arg)
{
	win32_pthread *th = (win32_pthread *) arg;

	th->result = th->routine(th->arg);

	return 0;
}

static int
pthread_create(pthread_t *thread,
			   pthread_attr_t *attr,
			   void *(*start_routine) (void *),
			   void *arg)
{
	int			save_errno;
	win32_pthread *th;

	th = (win32_pthread *) xmalloc(sizeof(win32_pthread));
	th->routine = start_routine;
	th->arg = arg;
	th->result = NULL;

	th->handle = (HANDLE) _beginthreadex(NULL, 0, win32_pthread_run, th, 0, NULL);
	if (th->handle == NULL)
	{
		save_errno = errno;
		free(th);
		return save_errno;
	}

	*thread = th;
	return 0;
}

static int
pthread_join(pthread_t th, void **thread_return)
{
	if (th == NULL || th->handle == NULL)
		return errno = EINVAL;

	if (WaitForSingleObject(th->handle, INFINITE) != WAIT_OBJECT_0)
	{
		_dosmaperr(GetLastError());
		return errno;
	}

	if (thread_return)
		*thread_return = th->result;

	CloseHandle(th->handle);
	free(th);
	return 0;
}

#endif   /* WIN32 */

pgbench_encode_withlz_ff100.htmtext/html; name=pgbench_encode_withlz_ff100.htmDownload

pgbench_encode_withlz_ff80.htmtext/html; name=pgbench_encode_withlz_ff80.htmDownload

Import Notes

Resolved by subject fallback

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 13 years ago

In reply to: Amit kapila (#2)

Hello, I looked into the patch and have some comments.

From the restriction of the time for this rather big patch,
please excuse that these comments are on a part of it. Others
will follow in few days.

==== heaptuple.c

noncachegetattr(_with_len):

- att_getlength should do strlen as worst case or VARSIZE_ANY
which is heavier than doing one comparizon, so I recommend to
add 'if (len)' as the restriction for doing this, and give NULL
as &len to nocachegetattr_with_len in nocachegetattr.

heap_attr_get_length_and_check_equals:

- Size seems to be used conventionary as the type for memory
object length, so it might be better using Size instead of
int32 as the type for *tup[12]_attr_len in parameter.

- This function returns always false for attrnum <= 0 as whole
tuple or some system attrs comparison regardless of the real
result, which is a bit different from the anticipation which
the name gives. If you need to keep this optimization, it
should have the name more specific to the purpose.

haap_delta_encode:

- Some misleading variable names (like match_not_found),
some reatitions of similiar codelets (att_align_pointer, pglz_out_tag),
misleading slight difference of the meanings of variables of
similar names(old_off and new_off and the similar pairs),
and bit tricky use of pglz_out_add and pglz_out_tag with length = 0.

These are welcome to be modified for better readability.

==== heapam.c

fastgetattr_with_len

- Missing left paren in the line 867 ('nocachegetattr_with_len(tup)...')

- Missing enclosing paren in heapam.c:879 (len, only on style)

- Allowing len = NULL will be good for better performance, like
noncachegetattr.

fastgetattr

- I suppose that the coding covension here is that macro and
alternative c-code are expected to be look similar. fastgetattr
looks quite differ to corresponding macro.

...

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Kyotaro HORIGUCHI (#3)

2 attachment(s)

On Friday, December 07, 2012 2:28 PM Kyotaro HORIGUCHI wrote:

Hello, I looked into the patch and have some comments.

Thank you for reviewing the patch.

From the restriction of the time for this rather big patch,
please excuse that these comments are on a part of it. Others
will follow in few days.

It's perfectly fine.

==== heaptuple.c

noncachegetattr(_with_len):

- att_getlength should do strlen as worst case or VARSIZE_ANY
which is heavier than doing one comparizon, so I recommend to
add 'if (len)' as the restriction for doing this, and give NULL
as &len to nocachegetattr_with_len in nocachegetattr.

Fixed.

heap_attr_get_length_and_check_equals:

- Size seems to be used conventionary as the type for memory
object length, so it might be better using Size instead of
int32 as the type for *tup[12]_attr_len in parameter.

Fixed.

- This function returns always false for attrnum <= 0 as whole
tuple or some system attrs comparison regardless of the real
result, which is a bit different from the anticipation which
the name gives. If you need to keep this optimization, it
should have the name more specific to the purpose.

The heap_attr_get_length_and_check_equals function is similar to heap_tuple_attr_equals,
the attrnum <= 0 check is required for heap_tuple_attr_equals.

haap_delta_encode:

- Some misleading variable names (like match_not_found),
some reatitions of similiar codelets (att_align_pointer, pglz_out_tag),
misleading slight difference of the meanings of variables of
similar names(old_off and new_off and the similar pairs),
and bit tricky use of pglz_out_add and pglz_out_tag with length = 0.

These are welcome to be modified for better readability.

The variable names are modified, please check them once.

The (att_align_pointer, pglz_out_tag) repetition code is added to take care of padding only incase of values are equal.
Use of pglz_out_add and pglz_out_tag with length = 0 is done because of code readability.

==== heapam.c

fastgetattr_with_len

- Missing left paren in the line 867 ('nocachegetattr_with_len(tup)...')

- Missing enclosing paren in heapam.c:879 (len, only on style)

- Allowing len = NULL will be good for better performance, like
noncachegetattr.

Fixed. except len=NULL because fastgetattr is modified as below comment.

fastgetattr

- I suppose that the coding covension here is that macro and
alternative c-code are expected to be look similar. fastgetattr
looks quite differ to corresponding macro.

Fixed.

Another change is also done to handle the history size of 2 bytes which is possible with the usage of LZ macro's for delta encoding.

With Regards,
Amit Kapila.

Attachments:

wal_update_changes_lz_v5.patchapplication/octet-stream; name=wal_update_changes_lz_v5.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,65 ****
--- 60,66 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
  
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
***************
*** 297,308 **** heap_attisnull(HeapTuple tup, int attnum)
  }
  
  /* ----------------
!  *		nocachegetattr
   *
!  *		This only gets called from fastgetattr() macro, in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
--- 298,310 ----
  }
  
  /* ----------------
!  *		nocachegetattr_with_len
   *
!  *		This only gets called in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor and
!  *		outputs the length of the attribute value.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
***************
*** 320,328 **** heap_attisnull(HeapTuple tup, int attnum)
   * ----------------
   */
  Datum
! nocachegetattr(HeapTuple tuple,
! 			   int attnum,
! 			   TupleDesc tupleDesc)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
--- 322,331 ----
   * ----------------
   */
  Datum
! nocachegetattr_with_len(HeapTuple tuple,
! 						int attnum,
! 						TupleDesc tupleDesc,
! 						Size *len)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
***************
*** 381,386 **** nocachegetattr(HeapTuple tuple,
--- 384,392 ----
  		 */
  		if (att[attnum]->attcacheoff >= 0)
  		{
+ 			if (len)
+ 				*len = att_getlength(att[attnum]->attlen,
+ 									 tp + att[attnum]->attcacheoff);
  			return fetchatt(att[attnum],
  							tp + att[attnum]->attcacheoff);
  		}
***************
*** 507,515 **** nocachegetattr(HeapTuple tuple,
--- 513,534 ----
  		}
  	}
  
+ 	if (len)
+ 		*len = att_getlength(att[attnum]->attlen, tp + off);
  	return fetchatt(att[attnum], tp + off);
  }
  
+ /*
+  *	nocachegetattr
+  */
+ Datum
+ nocachegetattr(HeapTuple tuple,
+ 			   int attnum,
+ 			   TupleDesc tupleDesc)
+ {
+ 	return nocachegetattr_with_len(tuple, attnum, tupleDesc, NULL);
+ }
+ 
  /* ----------------
   *		heap_getsysattr
   *
***************
*** 618,623 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 637,1015 ----
  }
  
  /*
+  * Check if the specified attribute's value is same in both given tuples.
+  * and outputs the length of the given attribute in both tuples.
+  */
+ bool
+ heap_attr_get_length_and_check_equals(TupleDesc tupdesc, int attrnum,
+ 									  HeapTuple tup1, HeapTuple tup2,
+ 									Size *tup1_attr_len, Size *tup2_attr_len)
+ {
+ 	Datum		value1,
+ 				value2;
+ 	bool		isnull1,
+ 				isnull2;
+ 	Form_pg_attribute att;
+ 
+ 	*tup1_attr_len = 0;
+ 	*tup2_attr_len = 0;
+ 
+ 	/*
+ 	 * If it's a whole-tuple reference, say "not equal".  It's not really
+ 	 * worth supporting this case, since it could only succeed after a no-op
+ 	 * update, which is hardly a case worth optimizing for.
+ 	 */
+ 	if (attrnum == 0)
+ 		return false;
+ 
+ 	/*
+ 	 * Likewise, automatically say "not equal" for any system attribute other
+ 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ 	 * chain, or even to be set correctly yet in the new tuple.
+ 	 */
+ 	if (attrnum < 0)
+ 	{
+ 		if (attrnum != ObjectIdAttributeNumber &&
+ 			attrnum != TableOidAttributeNumber)
+ 			return false;
+ 	}
+ 
+ 	/*
+ 	 * Extract the corresponding values.  XXX this is pretty inefficient if
+ 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
+ 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
+ 	 * work for system columns ...
+ 	 */
+ 	value1 = heap_getattr_with_len(tup1, attrnum, tupdesc, &isnull1, tup1_attr_len);
+ 	value2 = heap_getattr_with_len(tup2, attrnum, tupdesc, &isnull2, tup2_attr_len);
+ 
+ 	/*
+ 	 * If one value is NULL and other is not, then they are certainly not
+ 	 * equal
+ 	 */
+ 	if (isnull1 != isnull2)
+ 		return false;
+ 
+ 	/*
+ 	 * If both are NULL, they can be considered equal.
+ 	 */
+ 	if (isnull1)
+ 		return true;
+ 
+ 	/*
+ 	 * We do simple binary comparison of the two datums.  This may be overly
+ 	 * strict because there can be multiple binary representations for the
+ 	 * same logical value.	But we should be OK as long as there are no false
+ 	 * positives.  Using a type-specific equality operator is messy because
+ 	 * there could be multiple notions of equality in different operator
+ 	 * classes; furthermore, we cannot safely invoke user-defined functions
+ 	 * while holding exclusive buffer lock.
+ 	 */
+ 	if (attrnum <= 0)
+ 	{
+ 		/* The only allowed system columns are OIDs, so do this */
+ 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ 	}
+ 	else
+ 	{
+ 		Assert(attrnum <= tupdesc->natts);
+ 		att = tupdesc->attrs[attrnum - 1];
+ 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ 	}
+ }
+ 
+ /* ----------------
+  * heap_delta_encode
+  *		Forms an encoded data from old and new tuple with the modified columns
+  *		using an algorithm similar to LZ algorithm.
+  *
+  *		tupleDesc - Tuple descriptor.
+  *		oldtup - pointer to the old/history tuple.
+  *		newtup - pointer to the new tuple.
+  *		encdata - pointer to the encoded data using lz algorithm.
+  *
+  *		Encode the bitmap [+padding] [+oid] as a new data. And loop for all
+  *		attributes to find any modifications in the attributes.
+  *
+  *		The unmodified data is encoded as a history tag to the output and the
+  *		modifed data is encoded as new data to the output.
+  *
+  *		If the encoded output data is less than 75% of original data,
+  *		The output data is considered as encoded and proceed further.
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  PGLZ_Header *encdata)
+ {
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	int			numberOfAttributes;
+ 	int32		new_tup_off = 0,
+ 				old_tup_off = 0,
+ 				temp_off = 0,
+ 				match_off = 0,
+ 				change_off = 0;
+ 	int			attnum;
+ 	int32		data_len,
+ 				old_tup_pad_len,
+ 				new_tup_pad_len;
+ 	Size		old_tup_attr_len,
+ 				new_tup_attr_len;
+ 	bool		is_attr_equals = true;
+ 	unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+ 	unsigned char *bstart = bp;
+ 	char	   *dp = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	char	   *dstart = dp;
+ 	char	   *history;
+ 	unsigned char ctrl_dummy = 0;
+ 	unsigned char *ctrlp = &ctrl_dummy;
+ 	unsigned char ctrlb = 0;
+ 	unsigned char ctrl = 0;
+ 	int32		len,
+ 				old_tup_bitmaplen,
+ 				new_tup_bitmaplen,
+ 				new_tup_len;
+ 	int32		result_size;
+ 	int32		result_max;
+ 
+ 	/* Include the bitmap header in the lz encoded data. */
+ 	history = (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	old_tup_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * The maximum encoded data is of 75% of total size. The max tuple size is
+ 	 * already validated as it cannot be more than MaxHeapTupleSize.
+ 	 */
+ 	result_max = (new_tup_len * 75) / 100;
+ 	encdata->rawsize = new_tup_len;
+ 
+ 	/*
+ 	 * Check for output buffer is reached the result_max by advancing the
+ 	 * buffer by the calculated aproximate length for the corresponding
+ 	 * operation.
+ 	 */
+ 	if ((bp + (2 * new_tup_bitmaplen)) - bstart >= result_max)
+ 		return false;
+ 
+ 	/* Copy the bitmap data from new tuple to the encoded data buffer */
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_bitmaplen, dp);
+ 	dstart = dp;
+ 
+ 	numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+ 	for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+ 	{
+ 		/*
+ 		 * If the attribute is modified by the update operation, store the
+ 		 * appropiate offsets in the WAL record, otherwise skip to the next
+ 		 * attribute.
+ 		 */
+ 		if (!heap_attr_get_length_and_check_equals(tupleDesc, attnum, oldtup,
+ 							   newtup, &old_tup_attr_len, &new_tup_attr_len))
+ 		{
+ 			is_attr_equals = false;
+ 			data_len = old_tup_off - match_off;
+ 
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 			if ((bp + len) - bstart >= result_max)
+ 				return false;
+ 
+ 			/*
+ 			 * The match_off value is calculated w.r.t to the tuple t_hoff
+ 			 * value, the bit map len needs to be added to match_off to get
+ 			 * the actual start offfset from the old/history tuple.
+ 			 */
+ 			match_off += old_tup_bitmaplen;
+ 
+ 			/*
+ 			 * If any unchanged data presents in the old and new tuples then
+ 			 * encode the data as it needs to copy from history tuple with len
+ 			 * and offset.
+ 			 */
+ 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 
+ 			/*
+ 			 * Recalculate the old and new tuple offsets based on padding
+ 			 * present in the tuples
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				old_tup_off = att_align_pointer(old_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ 			}
+ 
+ 			if (!HeapTupleHasNulls(newtup)
+ 				|| !att_isnull((attnum - 1), newtup->t_data->t_bits))
+ 			{
+ 				new_tup_off = att_align_pointer(new_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ 			}
+ 
+ 			old_tup_off += old_tup_attr_len;
+ 			new_tup_off += new_tup_attr_len;
+ 
+ 			match_off = old_tup_off;
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			data_len = new_tup_off - change_off;
+ 			if ((bp + (2 * data_len)) - bstart >= result_max)
+ 				return false;
+ 
+ 			/* Copy the modified column data to the output buffer if present */
+ 			pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 			/*
+ 			 * calculate the old tuple field start position, required to
+ 			 * ignore if any alignmet is present.
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				temp_off = old_tup_off;
+ 				old_tup_off = att_align_pointer(old_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ 
+ 				old_tup_pad_len = old_tup_off - temp_off;
+ 
+ 				/*
+ 				 * calculate the new tuple field start position to check
+ 				 * whether any padding is required or not because field
+ 				 * alignment.
+ 				 */
+ 				temp_off = new_tup_off;
+ 				new_tup_off = att_align_pointer(new_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ 				new_tup_pad_len = new_tup_off - temp_off;
+ 
+ 				/*
+ 				 * Checking for that is there any alignment difference between
+ 				 * old and new tuple attributes.
+ 				 */
+ 				if (old_tup_pad_len != new_tup_pad_len)
+ 				{
+ 					/*
+ 					 * If the alignment difference is found between old and
+ 					 * new tuples and the last attribute value of the new
+ 					 * tuple is same as old tuple then write the encode as
+ 					 * history data until the current match.
+ 					 *
+ 					 * If the last attribute value of new tuple is not same as
+ 					 * old tuple then the matched data marking as history is
+ 					 * already taken care.
+ 					 */
+ 					if (is_attr_equals)
+ 					{
+ 						/*
+ 						 * Check for output buffer is reached the result_max
+ 						 * by advancing the buffer by the calculated
+ 						 * aproximate length for the corresponding operation.
+ 						 */
+ 						data_len = old_tup_off - old_tup_pad_len - match_off;
+ 						len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 						if ((bp + len) - bstart >= result_max)
+ 							return false;
+ 
+ 						match_off += old_tup_bitmaplen;
+ 						pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 					}
+ 
+ 					match_off = old_tup_off;
+ 
+ 					/* Alignment data */
+ 					if ((bp + (2 * new_tup_pad_len)) - bstart >= result_max)
+ 						return false;
+ 
+ 					pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_pad_len, dp);
+ 				}
+ 			}
+ 
+ 			old_tup_off += old_tup_attr_len;
+ 			new_tup_off += new_tup_attr_len;
+ 
+ 			change_off = new_tup_off;
+ 
+ 			/*
+ 			 * Recalculate the destination pointer with the new offset which
+ 			 * is used while copying the modified data.
+ 			 */
+ 			dp = dstart + new_tup_off;
+ 			is_attr_equals = true;
+ 		}
+ 	}
+ 
+ 	/* If any modified column data presents then copy it. */
+ 	data_len = new_tup_off - change_off;
+ 	if ((bp + (2 * data_len)) - bstart >= result_max)
+ 		return false;
+ 
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 	/* If any left out old tuple data presents then copy it as history */
+ 	data_len = old_tup_off - match_off;
+ 	len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 	if ((bp + len) - bstart >= result_max)
+ 		return false;
+ 
+ 	match_off += old_tup_bitmaplen;
+ 	pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 
+ 	/*
+ 	 * Write out the last control byte and check that we haven't overrun the
+ 	 * output size allowed by the strategy.
+ 	 */
+ 	*ctrlp = ctrlb;
+ 
+ 	result_size = bp - bstart;
+ 	if (result_size >= result_max)
+ 		return false;
+ 
+ 	/*
+ 	 * Success - need only fill in the actual length of the compressed datum.
+ 	 */
+ 	SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+ 	return true;
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *		Decodes the encoded data to dest tuple with the help of history.
+  *
+  *		encdata - Pointer to the encoded data.
+  *		oldtup - pointer to the history tuple.
+  *		newtup - pointer to the destination tuple.
+  * ----------------
+  */
+ void
+ heap_delta_decode(PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	return pglz_decompress_with_history((char *) encdata,
+ 			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 										&newtup->t_len,
+ 			(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
+ /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
   *		which are of the length indicated by tupleDescriptor->natts
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 85,90 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
--- 85,91 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+ 				HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
***************
*** 844,849 **** heapgettup_pagemode(HeapScanDesc scan,
--- 845,898 ----
   * definition in access/htup.h is maintained.
   */
  Datum
+ fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+ 					 bool *isnull, int32 *len)
+ {
+ 	return (
+ 			(attnum) > 0 ?
+ 			(
+ 			 (*(isnull) = false),
+ 			 HeapTupleNoNulls(tup) ?
+ 			 (
+ 			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
+ 			  (
+ 			(*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
+ 							 (char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ 							 (tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
+ 			   fetchatt((tupleDesc)->attrs[(attnum) - 1],
+ 						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ 						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
+ 			   )
+ 			  :
+ 			  (
+ 			   nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
+ 			  )
+ 			 :
+ 			 (
+ 			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
+ 			  (
+ 			   (*(isnull) = true),
+ 			   (*(len) = 0),
+ 			   (Datum) NULL
+ 			   )
+ 			  :
+ 			  (
+ 			   nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))
+ 			   )
+ 			  )
+ 			 )
+ 			:
+ 			(
+ 			 (Datum) NULL
+ 			 )
+ 		);
+ }
+ 
+ /*
+  * This is formatted so oddly so that the correspondence to the macro
+  * definition in access/htup.h is maintained.
+  */
+ Datum
  fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull)
  {
***************
*** 860,866 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr((tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
--- 909,916 ----
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  (
! 			   nocachegetattr(tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
***************
*** 2383,2389 **** simple_heap_insert(Relation relation, HeapTuple tup)
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			CommandId cid, Snapshot crosscheck, bool wait,
! 			HeapUpdateFailureData *hufd)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2433,2439 ----
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			CommandId cid, Snapshot crosscheck, bool wait,
! 			HeapUpdateFailureData * hufd)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
***************
*** 3212,3221 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3262,3273 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 
! 		recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 								 newbuf, heaptup, &oldtup,
! 								 all_visible_cleared,
! 								 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
***************
*** 3282,3355 **** static bool
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Datum		value1,
! 				value2;
! 	bool		isnull1,
! 				isnull2;
! 	Form_pg_attribute att;
! 
! 	/*
! 	 * If it's a whole-tuple reference, say "not equal".  It's not really
! 	 * worth supporting this case, since it could only succeed after a no-op
! 	 * update, which is hardly a case worth optimizing for.
! 	 */
! 	if (attrnum == 0)
! 		return false;
! 
! 	/*
! 	 * Likewise, automatically say "not equal" for any system attribute other
! 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
! 	 * chain, or even to be set correctly yet in the new tuple.
! 	 */
! 	if (attrnum < 0)
! 	{
! 		if (attrnum != ObjectIdAttributeNumber &&
! 			attrnum != TableOidAttributeNumber)
! 			return false;
! 	}
  
! 	/*
! 	 * Extract the corresponding values.  XXX this is pretty inefficient if
! 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
! 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
! 	 * work for system columns ...
! 	 */
! 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
! 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
! 
! 	/*
! 	 * If one value is NULL and other is not, then they are certainly not
! 	 * equal
! 	 */
! 	if (isnull1 != isnull2)
! 		return false;
! 
! 	/*
! 	 * If both are NULL, they can be considered equal.
! 	 */
! 	if (isnull1)
! 		return true;
! 
! 	/*
! 	 * We do simple binary comparison of the two datums.  This may be overly
! 	 * strict because there can be multiple binary representations for the
! 	 * same logical value.	But we should be OK as long as there are no false
! 	 * positives.  Using a type-specific equality operator is messy because
! 	 * there could be multiple notions of equality in different operator
! 	 * classes; furthermore, we cannot safely invoke user-defined functions
! 	 * while holding exclusive buffer lock.
! 	 */
! 	if (attrnum <= 0)
! 	{
! 		/* The only allowed system columns are OIDs, so do this */
! 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
! 	}
! 	else
! 	{
! 		Assert(attrnum <= tupdesc->natts);
! 		att = tupdesc->attrs[attrnum - 1];
! 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
! 	}
  }
  
  /*
--- 3334,3344 ----
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Size		tup1_attr_len,
! 				tup2_attr_len;
  
! 	return heap_attr_get_length_and_check_equals(tupdesc, attrnum, tup1, tup2,
! 											 &tup1_attr_len, &tup2_attr_len);
  }
  
  /*
***************
*** 4447,4453 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
--- 4436,4442 ----
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
***************
*** 4456,4461 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 4445,4461 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	int			oldtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds max output possible from the LZ algorithm */
+ 	struct
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[MaxHeapTupleSize];
+ 	}			buf;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 4465,4475 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4465,4505 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 	oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/* Is the update is going to the same page? */
+ 	if (oldbuf == newbuf)
+ 	{
+ 		/*
+ 		 * LZ algorithm can hold only history offset in the range of 1 - 4095.
+ 		 * so the delta encode is restricted for the tuples with length more
+ 		 * than PGLZ_HISTORY_SIZE.
+ 		 */
+ 		if (oldtuplen < PGLZ_HISTORY_SIZE)
+ 		{
+ 			/* Delta-encode the new tuple using the old tuple */
+ 			if (heap_delta_encode(reln->rd_att, oldtup, newtup,
+ 								  &buf.pglzheader))
+ 			{
+ 				compressed = true;
+ 				newtupdata = (char *) &buf.pglzheader;
+ 				newtuplen = VARSIZE(&buf.pglzheader);
+ 			}
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 4496,4504 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 4526,4537 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! 	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 5274,5280 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5307,5316 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 5289,5295 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5325,5331 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5349,5355 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
--- 5385,5391 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
***************
*** 5368,5374 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5404,5410 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5393,5399 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5429,5435 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5456,5465 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 5492,5520 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the new tuple was delta-encoded, decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/* PG93FORMAT: LZ header + Encoded data */
! 		PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
! 
! 		oldtup.t_data = oldtupdata;
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode(encoded_data, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 5474,5480 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 5529,5535 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 182,190 ****
   */
  #define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
  #define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
- #define PGLZ_HISTORY_SIZE		4096
- #define PGLZ_MAX_MATCH			273
- 
  
  /* ----------
   * PGLZ_HistEntry -
--- 182,187 ----
***************
*** 302,368 **** do {									\
  			}																\
  } while (0)
  
- 
- /* ----------
-  * pglz_out_ctrl -
-  *
-  *		Outputs the last and allocates a new control byte if needed.
-  * ----------
-  */
- #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
- do { \
- 	if ((__ctrl & 0xff) == 0)												\
- 	{																		\
- 		*(__ctrlp) = __ctrlb;												\
- 		__ctrlp = (__buf)++;												\
- 		__ctrlb = 0;														\
- 		__ctrl = 1;															\
- 	}																		\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_literal -
-  *
-  *		Outputs a literal byte to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	*(_buf)++ = (unsigned char)(_byte);										\
- 	_ctrl <<= 1;															\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_tag -
-  *
-  *		Outputs a backward reference tag of 2-4 bytes (depending on
-  *		offset and length) to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	_ctrlb |= _ctrl;														\
- 	_ctrl <<= 1;															\
- 	if (_len > 17)															\
- 	{																		\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
- 		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
- 		(_buf)[2] = (unsigned char)((_len) - 18);							\
- 		(_buf) += 3;														\
- 	} else {																\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
- 		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
- 		(_buf) += 2;														\
- 	}																		\
- } while (0)
- 
- 
  /* ----------
   * pglz_find_match -
   *
--- 299,304 ----
***************
*** 595,601 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			 * Create the tag and add history entries for all matched
  			 * characters.
  			 */
! 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
  			while (match_len--)
  			{
  				pglz_hist_add(hist_start, hist_entries,
--- 531,537 ----
  			 * Create the tag and add history entries for all matched
  			 * characters.
  			 */
! 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off, dp);
  			while (match_len--)
  			{
  				pglz_hist_add(hist_start, hist_entries,
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 583,620 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest.
+  *		To decompress, it uses history if provided.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 699,714 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  					break;
  				}
  
! 				/*
! 				 * Now we copy the bytes specified by the tag from OUTPUT to
! 				 * OUTPUT. It is dangerous and platform dependent to use
! 				 * memcpy() here, because the copied areas could overlap
! 				 * extremely!
! 				 */
! 				while (len--)
  				{
! 					*dp = dp[-off];
! 					dp++;
  				}
  			}
  			else
--- 658,685 ----
  					break;
  				}
  
! 				if (history)
! 				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from history to
! 					 * OUTPUT.
! 					 */
! 					memcpy(dp, history + off, len);
! 					dp += len;
! 				}
! 				else
  				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from OUTPUT to
! 					 * OUTPUT. It is dangerous and platform dependent to use
! 					 * memcpy() here, because the copied areas could overlap
! 					 * extremely!
! 					 */
! 					while (len--)
! 					{
! 						*dp = dp[-off];
! 						dp++;
! 					}
  				}
  			}
  			else
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 142,161 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	char		flags;			/* flag bits, see below */
! 
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! 
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01 /* Indicates as old page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02 /* Indicates as new page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04 /* Indicates as the update
! 												operation is delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 18,23 ****
--- 18,24 ----
  #include "access/tupdesc.h"
  #include "access/tupmacs.h"
  #include "storage/bufpage.h"
+ #include "utils/pg_lzcompress.h"
  
  /*
   * MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
***************
*** 528,533 **** struct MinimalTupleData
--- 529,535 ----
  		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
  
  
+ #if !defined(DISABLE_COMPLEX_MACRO)
  /* ----------------
   *		fastgetattr
   *
***************
*** 542,550 **** struct MinimalTupleData
   *		lookups, and call nocachegetattr() for the rest.
   * ----------------
   */
- 
- #if !defined(DISABLE_COMPLEX_MACRO)
- 
  #define fastgetattr(tup, attnum, tupleDesc, isnull)					\
  (																	\
  	AssertMacro((attnum) > 0),										\
--- 544,549 ----
***************
*** 572,585 **** struct MinimalTupleData
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
  )
- #else							/* defined(DISABLE_COMPLEX_MACRO) */
  
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
- 
  /* ----------------
   *		heap_getattr
   *
--- 571,626 ----
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
+ )																	\
+ 
+ /* ----------------
+  *		fastgetattr_with_len
+  *
+  *		Similar to fastgetattr and fetches the length of the given attribute
+  *		also.
+  * ----------------
+  */
+ #define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len)	\
+ (																	\
+ 	AssertMacro((attnum) > 0),										\
+ 	(*(isnull) = false),											\
+ 	HeapTupleNoNulls(tup) ? 										\
+ 	(																\
+ 		(tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ?			\
+ 		(															\
+ 			(*(len) = att_getlength(								\
+ 					(tupleDesc)->attrs[(attnum)-1]->attlen, 		\
+ 					(char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)),	\
+ 			fetchatt((tupleDesc)->attrs[(attnum)-1],				\
+ 				(char *) (tup)->t_data + (tup)->t_data->t_hoff +	\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)	\
+ 		)															\
+ 		:															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 	)																\
+ 	:																\
+ 	(																\
+ 		att_isnull((attnum)-1, (tup)->t_data->t_bits) ? 			\
+ 		(															\
+ 			(*(isnull) = true), 									\
+ 			(*(len) = 0),											\
+ 			(Datum)NULL 											\
+ 		)															\
+ 		:															\
+ 		(															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 		)															\
+ 	)																\
  )
  
+ #else							/* defined(DISABLE_COMPLEX_MACRO) */
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
+ extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+ 			TupleDesc tupleDesc, bool *isnull, int32 *len);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  /* ----------------
   *		heap_getattr
   *
***************
*** 596,616 **** extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
  	( \
! 		((attnum) > 0) ? \
  		( \
! 			((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
! 			( \
! 				(*(isnull) = true), \
! 				(Datum)NULL \
! 			) \
! 			: \
! 				fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
  		) \
  		: \
! 			heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	)
  
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
--- 637,679 ----
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
+ ( \
+ 	((attnum) > 0) ? \
  	( \
! 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
  		( \
! 			(*(isnull) = true), \
! 			(Datum)NULL \
  		) \
  		: \
! 			fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	) \
! 	: \
! 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
  
+ /* ----------------
+  *		heap_getattr_with_len
+  *
+  *		Similar to heap_getattr and outputs the length of the given attribute.
+  * ----------------
+  */
+ #define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ 	((attnum) > 0) ? \
+ 	( \
+ 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+ 		( \
+ 			(*(isnull) = true), \
+ 			(*(len) = 0), \
+ 			(Datum)NULL \
+ 		) \
+ 		: \
+ 			fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+ 	) \
+ 	: \
+ 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+ )
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
***************
*** 620,625 **** extern void heap_fill_tuple(TupleDesc tupleDesc,
--- 683,690 ----
  				char *data, Size data_size,
  				uint16 *infomask, bits8 *bit);
  extern bool heap_attisnull(HeapTuple tup, int attnum);
+ extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+ 			   TupleDesc att, Size *len);
  extern Datum nocachegetattr(HeapTuple tup, int attnum,
  			   TupleDesc att);
  extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
***************
*** 636,641 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 701,714 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_attr_get_length_and_check_equals(TupleDesc tupdesc,
+ 						int attrnum, HeapTuple tup1, HeapTuple tup2,
+ 						Size *tup1_attr_len, Size *tup2_attr_len);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				HeapTuple newtup, PGLZ_Header *encdata);
+ extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+ 				HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/tupmacs.h
--- b/src/include/access/tupmacs.h
***************
*** 187,192 ****
--- 187,214 ----
  )
  
  /*
+  * att_getlength -
+  *			Gets the length of the attribute.
+  */
+ #define att_getlength(attlen, attptr) \
+ ( \
+ 	((attlen) > 0) ? \
+ 	( \
+ 		(attlen) \
+ 	) \
+ 	: (((attlen) == -1) ? \
+ 	( \
+ 		VARSIZE_ANY(attptr) \
+ 	) \
+ 	: \
+ 	( \
+ 		AssertMacro((attlen) == -2), \
+ 		(strlen((char *) (attptr)) + 1) \
+ 	)) \
+ )
+ 
+ 
+ /*
   * store_att_byval is a partial inverse of fetch_att: store a given Datum
   * value into a tuple data area at the specified address.  However, it only
   * handles the byval case, because in typical usage the caller needs to
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 23,28 **** typedef struct PGLZ_Header
--- 23,30 ----
  	int32		rawsize;
  } PGLZ_Header;
  
+ #define PGLZ_HISTORY_SIZE		4096
+ #define PGLZ_MAX_MATCH			273
  
  /* ----------
   * PGLZ_MAX_OUTPUT -
***************
*** 86,91 **** typedef struct PGLZ_Strategy
--- 88,198 ----
  	int32		match_size_drop;
  } PGLZ_Strategy;
  
+ /*
+  * calculate the approximate length required for history encode tag for the
+  * given length
+  */
+ #define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+ ( \
+ 	((_len) < 17) ?	(3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH)))	\
+ )
+ 
+ /* ----------
+  * pglz_out_ctrl -
+  *
+  *		Outputs the last and allocates a new control byte if needed.
+  * ----------
+  */
+ #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+ do { \
+ 	if ((__ctrl & 0xff) == 0)												\
+ 	{																		\
+ 		*(__ctrlp) = __ctrlb;												\
+ 		__ctrlp = (__buf)++;												\
+ 		__ctrlb = 0;														\
+ 		__ctrl = 1; 														\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_literal -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+ do { \
+ 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 	*(_buf)++ = (unsigned char)(_byte); 									\
+ 	_ctrl <<= 1;															\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_tag -
+  *
+  *		Outputs a backward/history reference tag of 2-4 bytes (depending on
+  *		offset and length) to the destination buffer including the
+  *		appropriate control bit.
+  *
+  *		Split the process of backward/history reference as different chunks,
+  *		if the given lenght is more than max match and repeats the process
+  *		until the given length is processed.
+  *
+  *		If the matched history length is less than 3 bytes then add it as a
+  *		new data only during encoding instead of history reference. This occurs
+  *		only while framing delta record for wal update operation.
+  * ----------
+  */
+ #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_byte) \
+ do { \
+ 	int _mlen;																	\
+ 	int _total_len = (_len);													\
+ 	while (_total_len > 0)														\
+ 	{																			\
+ 		_mlen = _total_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _total_len;		\
+ 		if (_mlen < 3)															\
+ 		{																		\
+ 			(_byte) = (char *)(_byte) + (_off);									\
+ 			pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_mlen,(_byte));				\
+ 			break;																\
+ 		}																		\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 		_ctrlb |= _ctrl;														\
+ 		_ctrl <<= 1;															\
+ 		if (_mlen > 17)															\
+ 		{																		\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+ 			(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+ 			(_buf)[2] = (unsigned char)((_mlen) - 18);							\
+ 			(_buf) += 3;														\
+ 		} else {																\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mlen) - 3)); \
+ 			(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+ 			(_buf) += 2;														\
+ 		}																		\
+ 		_total_len -= _mlen;													\
+ 		(_off) += _mlen;														\
+ 	}																			\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_add -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit until the given input length.
+  * ----------
+  */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ 	int32 _total_len = (_len);											\
+ 	while (_total_len-- > 0)											\
+ 	{																	\
+ 		pglz_out_literal(_ctrlp, _ctrlb, _ctrl, _buf, *(_byte));		\
+ 		(_byte) = (char *)(_byte) + 1;									\
+ 	}																	\
+ } while (0)
+ 
  
  /* ----------
   * The standard strategies
***************
*** 108,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! 
  #endif   /* _PG_LZCOMPRESS_H_ */
--- 215,220 ----
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! extern void pglz_decompress_with_history(const char *source, char *dest,
! 				uint32 *destlen, const char *history);
  #endif   /* _PG_LZCOMPRESS_H_ */

wal_update_changes_mod_lz_v6.patchapplication/octet-stream; name=wal_update_changes_mod_lz_v6.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,65 ****
--- 60,66 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
  
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
***************
*** 297,308 **** heap_attisnull(HeapTuple tup, int attnum)
  }
  
  /* ----------------
!  *		nocachegetattr
   *
!  *		This only gets called from fastgetattr() macro, in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
--- 298,310 ----
  }
  
  /* ----------------
!  *		nocachegetattr_with_len
   *
!  *		This only gets called in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor and
!  *		outputs the length of the attribute value.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
***************
*** 320,328 **** heap_attisnull(HeapTuple tup, int attnum)
   * ----------------
   */
  Datum
! nocachegetattr(HeapTuple tuple,
! 			   int attnum,
! 			   TupleDesc tupleDesc)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
--- 322,331 ----
   * ----------------
   */
  Datum
! nocachegetattr_with_len(HeapTuple tuple,
! 						int attnum,
! 						TupleDesc tupleDesc,
! 						Size *len)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
***************
*** 381,386 **** nocachegetattr(HeapTuple tuple,
--- 384,392 ----
  		 */
  		if (att[attnum]->attcacheoff >= 0)
  		{
+ 			if (len)
+ 				*len = att_getlength(att[attnum]->attlen,
+ 									 tp + att[attnum]->attcacheoff);
  			return fetchatt(att[attnum],
  							tp + att[attnum]->attcacheoff);
  		}
***************
*** 507,515 **** nocachegetattr(HeapTuple tuple,
--- 513,534 ----
  		}
  	}
  
+ 	if (len)
+ 		*len = att_getlength(att[attnum]->attlen, tp + off);
  	return fetchatt(att[attnum], tp + off);
  }
  
+ /*
+  *	nocachegetattr
+  */
+ Datum
+ nocachegetattr(HeapTuple tuple,
+ 			   int attnum,
+ 			   TupleDesc tupleDesc)
+ {
+ 	return nocachegetattr_with_len(tuple, attnum, tupleDesc, NULL);
+ }
+ 
  /* ----------------
   *		heap_getsysattr
   *
***************
*** 618,623 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 637,1015 ----
  }
  
  /*
+  * Check if the specified attribute's value is same in both given tuples.
+  * and outputs the length of the given attribute in both tuples.
+  */
+ bool
+ heap_attr_get_length_and_check_equals(TupleDesc tupdesc, int attrnum,
+ 									  HeapTuple tup1, HeapTuple tup2,
+ 									Size *tup1_attr_len, Size *tup2_attr_len)
+ {
+ 	Datum		value1,
+ 				value2;
+ 	bool		isnull1,
+ 				isnull2;
+ 	Form_pg_attribute att;
+ 
+ 	*tup1_attr_len = 0;
+ 	*tup2_attr_len = 0;
+ 
+ 	/*
+ 	 * If it's a whole-tuple reference, say "not equal".  It's not really
+ 	 * worth supporting this case, since it could only succeed after a no-op
+ 	 * update, which is hardly a case worth optimizing for.
+ 	 */
+ 	if (attrnum == 0)
+ 		return false;
+ 
+ 	/*
+ 	 * Likewise, automatically say "not equal" for any system attribute other
+ 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ 	 * chain, or even to be set correctly yet in the new tuple.
+ 	 */
+ 	if (attrnum < 0)
+ 	{
+ 		if (attrnum != ObjectIdAttributeNumber &&
+ 			attrnum != TableOidAttributeNumber)
+ 			return false;
+ 	}
+ 
+ 	/*
+ 	 * Extract the corresponding values.  XXX this is pretty inefficient if
+ 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
+ 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
+ 	 * work for system columns ...
+ 	 */
+ 	value1 = heap_getattr_with_len(tup1, attrnum, tupdesc, &isnull1, tup1_attr_len);
+ 	value2 = heap_getattr_with_len(tup2, attrnum, tupdesc, &isnull2, tup2_attr_len);
+ 
+ 	/*
+ 	 * If one value is NULL and other is not, then they are certainly not
+ 	 * equal
+ 	 */
+ 	if (isnull1 != isnull2)
+ 		return false;
+ 
+ 	/*
+ 	 * If both are NULL, they can be considered equal.
+ 	 */
+ 	if (isnull1)
+ 		return true;
+ 
+ 	/*
+ 	 * We do simple binary comparison of the two datums.  This may be overly
+ 	 * strict because there can be multiple binary representations for the
+ 	 * same logical value.	But we should be OK as long as there are no false
+ 	 * positives.  Using a type-specific equality operator is messy because
+ 	 * there could be multiple notions of equality in different operator
+ 	 * classes; furthermore, we cannot safely invoke user-defined functions
+ 	 * while holding exclusive buffer lock.
+ 	 */
+ 	if (attrnum <= 0)
+ 	{
+ 		/* The only allowed system columns are OIDs, so do this */
+ 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ 	}
+ 	else
+ 	{
+ 		Assert(attrnum <= tupdesc->natts);
+ 		att = tupdesc->attrs[attrnum - 1];
+ 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ 	}
+ }
+ 
+ /* ----------------
+  * heap_delta_encode
+  *		Forms an encoded data from old and new tuple with the modified columns
+  *		using an algorithm similar to LZ algorithm.
+  *
+  *		tupleDesc - Tuple descriptor.
+  *		oldtup - pointer to the old/history tuple.
+  *		newtup - pointer to the new tuple.
+  *		encdata - pointer to the encoded data using lz algorithm.
+  *
+  *		Encode the bitmap [+padding] [+oid] as a new data. And loop for all
+  *		attributes to find any modifications in the attributes.
+  *
+  *		The unmodified data is encoded as a history tag to the output and the
+  *		modifed data is encoded as new data to the output.
+  *
+  *		If the encoded output data is less than 75% of original data,
+  *		The output data is considered as encoded and proceed further.
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  PGLZ_Header *encdata)
+ {
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	int			numberOfAttributes;
+ 	int32		new_tup_off = 0,
+ 				old_tup_off = 0,
+ 				temp_off = 0,
+ 				match_off = 0,
+ 				change_off = 0;
+ 	int			attnum;
+ 	int32		data_len,
+ 				old_tup_pad_len,
+ 				new_tup_pad_len;
+ 	Size		old_tup_attr_len,
+ 				new_tup_attr_len;
+ 	bool		is_attr_equals = true;
+ 	unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+ 	unsigned char *bstart = bp;
+ 	char	   *dp = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	char	   *dstart = dp;
+ 	char	   *history;
+ 	unsigned char ctrl_dummy = 0;
+ 	unsigned char *ctrlp = &ctrl_dummy;
+ 	unsigned char ctrlb = 0;
+ 	unsigned char ctrl = 0;
+ 	int32		len,
+ 				old_tup_bitmaplen,
+ 				new_tup_bitmaplen,
+ 				new_tup_len;
+ 	int32		result_size;
+ 	int32		result_max;
+ 
+ 	/* Include the bitmap header in the lz encoded data. */
+ 	history = (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	old_tup_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * The maximum encoded data is of 75% of total size. The max tuple size is
+ 	 * already validated as it cannot be more than MaxHeapTupleSize.
+ 	 */
+ 	result_max = (new_tup_len * 75) / 100;
+ 	encdata->rawsize = new_tup_len;
+ 
+ 	/*
+ 	 * Check for output buffer is reached the result_max by advancing the
+ 	 * buffer by the calculated aproximate length for the corresponding
+ 	 * operation.
+ 	 */
+ 	if ((bp + (2 * new_tup_bitmaplen)) - bstart >= result_max)
+ 		return false;
+ 
+ 	/* Copy the bitmap data from new tuple to the encoded data buffer */
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_bitmaplen, dp);
+ 	dstart = dp;
+ 
+ 	numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+ 	for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+ 	{
+ 		/*
+ 		 * If the attribute is modified by the update operation, store the
+ 		 * appropiate offsets in the WAL record, otherwise skip to the next
+ 		 * attribute.
+ 		 */
+ 		if (!heap_attr_get_length_and_check_equals(tupleDesc, attnum, oldtup,
+ 							   newtup, &old_tup_attr_len, &new_tup_attr_len))
+ 		{
+ 			is_attr_equals = false;
+ 			data_len = old_tup_off - match_off;
+ 
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 			if ((bp + len) - bstart >= result_max)
+ 				return false;
+ 
+ 			/*
+ 			 * The match_off value is calculated w.r.t to the tuple t_hoff
+ 			 * value, the bit map len needs to be added to match_off to get
+ 			 * the actual start offfset from the old/history tuple.
+ 			 */
+ 			match_off += old_tup_bitmaplen;
+ 
+ 			/*
+ 			 * If any unchanged data presents in the old and new tuples then
+ 			 * encode the data as it needs to copy from history tuple with len
+ 			 * and offset.
+ 			 */
+ 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 
+ 			/*
+ 			 * Recalculate the old and new tuple offsets based on padding
+ 			 * present in the tuples
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				old_tup_off = att_align_pointer(old_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ 			}
+ 
+ 			if (!HeapTupleHasNulls(newtup)
+ 				|| !att_isnull((attnum - 1), newtup->t_data->t_bits))
+ 			{
+ 				new_tup_off = att_align_pointer(new_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ 			}
+ 
+ 			old_tup_off += old_tup_attr_len;
+ 			new_tup_off += new_tup_attr_len;
+ 
+ 			match_off = old_tup_off;
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			data_len = new_tup_off - change_off;
+ 			if ((bp + (2 * data_len)) - bstart >= result_max)
+ 				return false;
+ 
+ 			/* Copy the modified column data to the output buffer if present */
+ 			pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 			/*
+ 			 * calculate the old tuple field start position, required to
+ 			 * ignore if any alignmet is present.
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				temp_off = old_tup_off;
+ 				old_tup_off = att_align_pointer(old_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ 
+ 				old_tup_pad_len = old_tup_off - temp_off;
+ 
+ 				/*
+ 				 * calculate the new tuple field start position to check
+ 				 * whether any padding is required or not because field
+ 				 * alignment.
+ 				 */
+ 				temp_off = new_tup_off;
+ 				new_tup_off = att_align_pointer(new_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ 				new_tup_pad_len = new_tup_off - temp_off;
+ 
+ 				/*
+ 				 * Checking for that is there any alignment difference between
+ 				 * old and new tuple attributes.
+ 				 */
+ 				if (old_tup_pad_len != new_tup_pad_len)
+ 				{
+ 					/*
+ 					 * If the alignment difference is found between old and
+ 					 * new tuples and the last attribute value of the new
+ 					 * tuple is same as old tuple then write the encode as
+ 					 * history data until the current match.
+ 					 *
+ 					 * If the last attribute value of new tuple is not same as
+ 					 * old tuple then the matched data marking as history is
+ 					 * already taken care.
+ 					 */
+ 					if (is_attr_equals)
+ 					{
+ 						/*
+ 						 * Check for output buffer is reached the result_max
+ 						 * by advancing the buffer by the calculated
+ 						 * aproximate length for the corresponding operation.
+ 						 */
+ 						data_len = old_tup_off - old_tup_pad_len - match_off;
+ 						len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 						if ((bp + len) - bstart >= result_max)
+ 							return false;
+ 
+ 						match_off += old_tup_bitmaplen;
+ 						pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 					}
+ 
+ 					match_off = old_tup_off;
+ 
+ 					/* Alignment data */
+ 					if ((bp + (2 * new_tup_pad_len)) - bstart >= result_max)
+ 						return false;
+ 
+ 					pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_pad_len, dp);
+ 				}
+ 			}
+ 
+ 			old_tup_off += old_tup_attr_len;
+ 			new_tup_off += new_tup_attr_len;
+ 
+ 			change_off = new_tup_off;
+ 
+ 			/*
+ 			 * Recalculate the destination pointer with the new offset which
+ 			 * is used while copying the modified data.
+ 			 */
+ 			dp = dstart + new_tup_off;
+ 			is_attr_equals = true;
+ 		}
+ 	}
+ 
+ 	/* If any modified column data presents then copy it. */
+ 	data_len = new_tup_off - change_off;
+ 	if ((bp + (2 * data_len)) - bstart >= result_max)
+ 		return false;
+ 
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 	/* If any left out old tuple data presents then copy it as history */
+ 	data_len = old_tup_off - match_off;
+ 	len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 	if ((bp + len) - bstart >= result_max)
+ 		return false;
+ 
+ 	match_off += old_tup_bitmaplen;
+ 	pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 
+ 	/*
+ 	 * Write out the last control byte and check that we haven't overrun the
+ 	 * output size allowed by the strategy.
+ 	 */
+ 	*ctrlp = ctrlb;
+ 
+ 	result_size = bp - bstart;
+ 	if (result_size >= result_max)
+ 		return false;
+ 
+ 	/*
+ 	 * Success - need only fill in the actual length of the compressed datum.
+ 	 */
+ 	SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+ 	return true;
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *		Decodes the encoded data to dest tuple with the help of history.
+  *
+  *		encdata - Pointer to the encoded data.
+  *		oldtup - pointer to the history tuple.
+  *		newtup - pointer to the destination tuple.
+  * ----------------
+  */
+ void
+ heap_delta_decode(PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	return pglz_decompress_with_history((char *) encdata,
+ 			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 										&newtup->t_len,
+ 			(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
+ /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
   *		which are of the length indicated by tupleDescriptor->natts
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 85,90 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
--- 85,91 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+ 				HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
***************
*** 844,849 **** heapgettup_pagemode(HeapScanDesc scan,
--- 845,898 ----
   * definition in access/htup.h is maintained.
   */
  Datum
+ fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+ 					 bool *isnull, int32 *len)
+ {
+ 	return (
+ 			(attnum) > 0 ?
+ 			(
+ 			 (*(isnull) = false),
+ 			 HeapTupleNoNulls(tup) ?
+ 			 (
+ 			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
+ 			  (
+ 			(*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
+ 							 (char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ 							 (tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
+ 			   fetchatt((tupleDesc)->attrs[(attnum) - 1],
+ 						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ 						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
+ 			   )
+ 			  :
+ 			  (
+ 			   nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
+ 			  )
+ 			 :
+ 			 (
+ 			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
+ 			  (
+ 			   (*(isnull) = true),
+ 			   (*(len) = 0),
+ 			   (Datum) NULL
+ 			   )
+ 			  :
+ 			  (
+ 			   nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))
+ 			   )
+ 			  )
+ 			 )
+ 			:
+ 			(
+ 			 (Datum) NULL
+ 			 )
+ 		);
+ }
+ 
+ /*
+  * This is formatted so oddly so that the correspondence to the macro
+  * definition in access/htup.h is maintained.
+  */
+ Datum
  fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull)
  {
***************
*** 860,866 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr((tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
--- 909,916 ----
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  (
! 			   nocachegetattr(tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
***************
*** 2383,2389 **** simple_heap_insert(Relation relation, HeapTuple tup)
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			CommandId cid, Snapshot crosscheck, bool wait,
! 			HeapUpdateFailureData *hufd)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2433,2439 ----
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			CommandId cid, Snapshot crosscheck, bool wait,
! 			HeapUpdateFailureData * hufd)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
***************
*** 2685,2691 **** simple_heap_delete(Relation relation, ItemPointer tid)
  
  	result = heap_delete(relation, tid,
  						 GetCurrentCommandId(true), InvalidSnapshot,
! 						 true /* wait for commit */,
  						 &hufd);
  	switch (result)
  	{
--- 2735,2741 ----
  
  	result = heap_delete(relation, tid,
  						 GetCurrentCommandId(true), InvalidSnapshot,
! 						 true /* wait for commit */ ,
  						 &hufd);
  	switch (result)
  	{
***************
*** 2742,2748 **** simple_heap_delete(Relation relation, ItemPointer tid)
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  			CommandId cid, Snapshot crosscheck, bool wait,
! 			HeapUpdateFailureData *hufd)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2792,2798 ----
  HTSU_Result
  heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
  			CommandId cid, Snapshot crosscheck, bool wait,
! 			HeapUpdateFailureData * hufd)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
***************
*** 3212,3221 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3262,3273 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 
! 		recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 								 newbuf, heaptup, &oldtup,
! 								 all_visible_cleared,
! 								 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
***************
*** 3282,3355 **** static bool
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Datum		value1,
! 				value2;
! 	bool		isnull1,
! 				isnull2;
! 	Form_pg_attribute att;
  
! 	/*
! 	 * If it's a whole-tuple reference, say "not equal".  It's not really
! 	 * worth supporting this case, since it could only succeed after a no-op
! 	 * update, which is hardly a case worth optimizing for.
! 	 */
! 	if (attrnum == 0)
! 		return false;
! 
! 	/*
! 	 * Likewise, automatically say "not equal" for any system attribute other
! 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
! 	 * chain, or even to be set correctly yet in the new tuple.
! 	 */
! 	if (attrnum < 0)
! 	{
! 		if (attrnum != ObjectIdAttributeNumber &&
! 			attrnum != TableOidAttributeNumber)
! 			return false;
! 	}
! 
! 	/*
! 	 * Extract the corresponding values.  XXX this is pretty inefficient if
! 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
! 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
! 	 * work for system columns ...
! 	 */
! 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
! 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
! 
! 	/*
! 	 * If one value is NULL and other is not, then they are certainly not
! 	 * equal
! 	 */
! 	if (isnull1 != isnull2)
! 		return false;
! 
! 	/*
! 	 * If both are NULL, they can be considered equal.
! 	 */
! 	if (isnull1)
! 		return true;
! 
! 	/*
! 	 * We do simple binary comparison of the two datums.  This may be overly
! 	 * strict because there can be multiple binary representations for the
! 	 * same logical value.	But we should be OK as long as there are no false
! 	 * positives.  Using a type-specific equality operator is messy because
! 	 * there could be multiple notions of equality in different operator
! 	 * classes; furthermore, we cannot safely invoke user-defined functions
! 	 * while holding exclusive buffer lock.
! 	 */
! 	if (attrnum <= 0)
! 	{
! 		/* The only allowed system columns are OIDs, so do this */
! 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
! 	}
! 	else
! 	{
! 		Assert(attrnum <= tupdesc->natts);
! 		att = tupdesc->attrs[attrnum - 1];
! 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
! 	}
  }
  
  /*
--- 3334,3344 ----
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Size		tup1_attr_len,
! 				tup2_attr_len;
  
! 	return heap_attr_get_length_and_check_equals(tupdesc, attrnum, tup1, tup2,
! 											 &tup1_attr_len, &tup2_attr_len);
  }
  
  /*
***************
*** 3400,3406 **** simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
  
  	result = heap_update(relation, otid, tup,
  						 GetCurrentCommandId(true), InvalidSnapshot,
! 						 true /* wait for commit */,
  						 &hufd);
  	switch (result)
  	{
--- 3389,3395 ----
  
  	result = heap_update(relation, otid, tup,
  						 GetCurrentCommandId(true), InvalidSnapshot,
! 						 true /* wait for commit */ ,
  						 &hufd);
  	switch (result)
  	{
***************
*** 3487,3493 **** simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
  HTSU_Result
  heap_lock_tuple(Relation relation, HeapTuple tuple,
  				CommandId cid, LockTupleMode mode, bool nowait,
! 				Buffer *buffer, HeapUpdateFailureData *hufd)
  {
  	HTSU_Result result;
  	ItemPointer tid = &(tuple->t_self);
--- 3476,3482 ----
  HTSU_Result
  heap_lock_tuple(Relation relation, HeapTuple tuple,
  				CommandId cid, LockTupleMode mode, bool nowait,
! 				Buffer *buffer, HeapUpdateFailureData * hufd)
  {
  	HTSU_Result result;
  	ItemPointer tid = &(tuple->t_self);
***************
*** 4447,4453 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
--- 4436,4442 ----
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
***************
*** 4456,4461 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 4445,4461 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	int			oldtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds max output possible from the LZ algorithm */
+ 	struct
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[MaxHeapTupleSize];
+ 	}			buf;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 4465,4475 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4465,4505 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 	oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/* Is the update is going to the same page? */
+ 	if (oldbuf == newbuf)
+ 	{
+ 		/*
+ 		 * LZ algorithm can hold only history offset in the range of 1 - 4095.
+ 		 * so the delta encode is restricted for the tuples with length more
+ 		 * than PGLZ_HISTORY_SIZE.
+ 		 */
+ 		if (oldtuplen < PGLZ_HISTORY_SIZE)
+ 		{
+ 			/* Delta-encode the new tuple using the old tuple */
+ 			if (heap_delta_encode(reln->rd_att, oldtup, newtup,
+ 								  &buf.pglzheader))
+ 			{
+ 				compressed = true;
+ 				newtupdata = (char *) &buf.pglzheader;
+ 				newtuplen = VARSIZE(&buf.pglzheader);
+ 			}
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 4496,4504 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 4526,4537 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! 	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 5274,5280 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5307,5316 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 5289,5295 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5325,5331 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5349,5355 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
--- 5385,5391 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
***************
*** 5368,5374 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5404,5410 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5393,5399 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5429,5435 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5456,5465 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 5492,5520 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the new tuple was delta-encoded, decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/* PG93FORMAT: LZ header + Encoded data */
! 		PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
! 
! 		oldtup.t_data = oldtupdata;
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode(encoded_data, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 5474,5480 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 5529,5535 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 182,190 ****
   */
  #define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
  #define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
- #define PGLZ_HISTORY_SIZE		4096
- #define PGLZ_MAX_MATCH			273
- 
  
  /* ----------
   * PGLZ_HistEntry -
--- 182,187 ----
***************
*** 302,368 **** do {									\
  			}																\
  } while (0)
  
- 
- /* ----------
-  * pglz_out_ctrl -
-  *
-  *		Outputs the last and allocates a new control byte if needed.
-  * ----------
-  */
- #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
- do { \
- 	if ((__ctrl & 0xff) == 0)												\
- 	{																		\
- 		*(__ctrlp) = __ctrlb;												\
- 		__ctrlp = (__buf)++;												\
- 		__ctrlb = 0;														\
- 		__ctrl = 1;															\
- 	}																		\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_literal -
-  *
-  *		Outputs a literal byte to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	*(_buf)++ = (unsigned char)(_byte);										\
- 	_ctrl <<= 1;															\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_tag -
-  *
-  *		Outputs a backward reference tag of 2-4 bytes (depending on
-  *		offset and length) to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	_ctrlb |= _ctrl;														\
- 	_ctrl <<= 1;															\
- 	if (_len > 17)															\
- 	{																		\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
- 		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
- 		(_buf)[2] = (unsigned char)((_len) - 18);							\
- 		(_buf) += 3;														\
- 	} else {																\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
- 		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
- 		(_buf) += 2;														\
- 	}																		\
- } while (0)
- 
- 
  /* ----------
   * pglz_find_match -
   *
--- 299,304 ----
***************
*** 595,601 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			 * Create the tag and add history entries for all matched
  			 * characters.
  			 */
! 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
  			while (match_len--)
  			{
  				pglz_hist_add(hist_start, hist_entries,
--- 531,537 ----
  			 * Create the tag and add history entries for all matched
  			 * characters.
  			 */
! 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off, dp);
  			while (match_len--)
  			{
  				pglz_hist_add(hist_start, hist_entries,
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 583,620 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest.
+  *		To decompress, it uses history if provided.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 699,726 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  					break;
  				}
  
! 				/*
! 				 * Now we copy the bytes specified by the tag from OUTPUT to
! 				 * OUTPUT. It is dangerous and platform dependent to use
! 				 * memcpy() here, because the copied areas could overlap
! 				 * extremely!
! 				 */
! 				while (len--)
  				{
! 					*dp = dp[-off];
! 					dp++;
  				}
  			}
  			else
  			{
! 				/*
! 				 * An unset control bit means LITERAL BYTE. So we just copy
! 				 * one from INPUT to OUTPUT.
! 				 */
! 				if (dp >= destend)		/* check for buffer overrun */
! 					break;		/* do not clobber memory */
! 
! 				*dp++ = *sp++;
  			}
  
  			/*
--- 658,735 ----
  					break;
  				}
  
! 				if (history)
! 				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from history to
! 					 * OUTPUT.
! 					 */
! 					memcpy(dp, history + off, len);
! 					dp += len;
! 				}
! 				else
  				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from OUTPUT to
! 					 * OUTPUT. It is dangerous and platform dependent to use
! 					 * memcpy() here, because the copied areas could overlap
! 					 * extremely!
! 					 */
! 					while (len--)
! 					{
! 						*dp = dp[-off];
! 						dp++;
! 					}
  				}
  			}
  			else
  			{
! 				if (history)
! 				{
! 					/*
! 					 * Otherwise it contains the match length minus 3 and the
! 					 * upper 4 bits of the offset. The next following byte
! 					 * contains the lower 8 bits of the offset. If the length is
! 					 * coded as 18, another extension tag byte tells how much
! 					 * longer the match really was (0-255).
! 					 */
! 					int32		len;
! 
! 					len = sp[0];
! 					sp += 1;
! 
! 					/*
! 					 * Check for output buffer overrun, to ensure we don't clobber
! 					 * memory in case of corrupt input.  Note: we must advance dp
! 					 * here to ensure the error is detected below the loop.  We
! 					 * don't simply put the elog inside the loop since that will
! 					 * probably interfere with optimization.
! 					 */
! 					if (dp + len > destend)
! 					{
! 						dp += len;
! 						break;
! 					}
! 
! 					/*
! 					 * Now we copy the bytes specified by the tag from Source to
! 					 * OUTPUT.
! 					 */
! 					memcpy(dp, sp, len);
! 					dp += len;
! 					sp += len;
! 				}
! 				else
! 				{
! 					/*
! 					 * An unset control bit means LITERAL BYTE. So we just copy
! 					 * one from INPUT to OUTPUT.
! 					 */
! 					if (dp >= destend)		/* check for buffer overrun */
! 						break;		/* do not clobber memory */
! 
! 					*dp++ = *sp++;
! 				}
  			}
  
  			/*
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 142,161 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	char		flags;			/* flag bits, see below */
! 
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! 
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01 /* Indicates as old page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02 /* Indicates as new page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04 /* Indicates as the update
! 												operation is delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 18,23 ****
--- 18,24 ----
  #include "access/tupdesc.h"
  #include "access/tupmacs.h"
  #include "storage/bufpage.h"
+ #include "utils/pg_lzcompress.h"
  
  /*
   * MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
***************
*** 528,533 **** struct MinimalTupleData
--- 529,535 ----
  		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
  
  
+ #if !defined(DISABLE_COMPLEX_MACRO)
  /* ----------------
   *		fastgetattr
   *
***************
*** 542,550 **** struct MinimalTupleData
   *		lookups, and call nocachegetattr() for the rest.
   * ----------------
   */
- 
- #if !defined(DISABLE_COMPLEX_MACRO)
- 
  #define fastgetattr(tup, attnum, tupleDesc, isnull)					\
  (																	\
  	AssertMacro((attnum) > 0),										\
--- 544,549 ----
***************
*** 572,585 **** struct MinimalTupleData
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
  )
- #else							/* defined(DISABLE_COMPLEX_MACRO) */
  
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
- 
  /* ----------------
   *		heap_getattr
   *
--- 571,626 ----
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
+ )																	\
+ 
+ /* ----------------
+  *		fastgetattr_with_len
+  *
+  *		Similar to fastgetattr and fetches the length of the given attribute
+  *		also.
+  * ----------------
+  */
+ #define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len)	\
+ (																	\
+ 	AssertMacro((attnum) > 0),										\
+ 	(*(isnull) = false),											\
+ 	HeapTupleNoNulls(tup) ? 										\
+ 	(																\
+ 		(tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ?			\
+ 		(															\
+ 			(*(len) = att_getlength(								\
+ 					(tupleDesc)->attrs[(attnum)-1]->attlen, 		\
+ 					(char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)),	\
+ 			fetchatt((tupleDesc)->attrs[(attnum)-1],				\
+ 				(char *) (tup)->t_data + (tup)->t_data->t_hoff +	\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)	\
+ 		)															\
+ 		:															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 	)																\
+ 	:																\
+ 	(																\
+ 		att_isnull((attnum)-1, (tup)->t_data->t_bits) ? 			\
+ 		(															\
+ 			(*(isnull) = true), 									\
+ 			(*(len) = 0),											\
+ 			(Datum)NULL 											\
+ 		)															\
+ 		:															\
+ 		(															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 		)															\
+ 	)																\
  )
  
+ #else							/* defined(DISABLE_COMPLEX_MACRO) */
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
+ extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+ 			TupleDesc tupleDesc, bool *isnull, int32 *len);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  /* ----------------
   *		heap_getattr
   *
***************
*** 596,616 **** extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
  	( \
! 		((attnum) > 0) ? \
  		( \
! 			((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
! 			( \
! 				(*(isnull) = true), \
! 				(Datum)NULL \
! 			) \
! 			: \
! 				fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
  		) \
  		: \
! 			heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	)
  
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
--- 637,679 ----
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
+ ( \
+ 	((attnum) > 0) ? \
  	( \
! 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
  		( \
! 			(*(isnull) = true), \
! 			(Datum)NULL \
  		) \
  		: \
! 			fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	) \
! 	: \
! 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
  
+ /* ----------------
+  *		heap_getattr_with_len
+  *
+  *		Similar to heap_getattr and outputs the length of the given attribute.
+  * ----------------
+  */
+ #define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ 	((attnum) > 0) ? \
+ 	( \
+ 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+ 		( \
+ 			(*(isnull) = true), \
+ 			(*(len) = 0), \
+ 			(Datum)NULL \
+ 		) \
+ 		: \
+ 			fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+ 	) \
+ 	: \
+ 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+ )
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
***************
*** 620,625 **** extern void heap_fill_tuple(TupleDesc tupleDesc,
--- 683,690 ----
  				char *data, Size data_size,
  				uint16 *infomask, bits8 *bit);
  extern bool heap_attisnull(HeapTuple tup, int attnum);
+ extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+ 			   TupleDesc att, Size *len);
  extern Datum nocachegetattr(HeapTuple tup, int attnum,
  			   TupleDesc att);
  extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
***************
*** 636,641 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 701,714 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_attr_get_length_and_check_equals(TupleDesc tupdesc,
+ 						int attrnum, HeapTuple tup1, HeapTuple tup2,
+ 						Size *tup1_attr_len, Size *tup2_attr_len);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				HeapTuple newtup, PGLZ_Header *encdata);
+ extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+ 				HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/tupmacs.h
--- b/src/include/access/tupmacs.h
***************
*** 187,192 ****
--- 187,214 ----
  )
  
  /*
+  * att_getlength -
+  *			Gets the length of the attribute.
+  */
+ #define att_getlength(attlen, attptr) \
+ ( \
+ 	((attlen) > 0) ? \
+ 	( \
+ 		(attlen) \
+ 	) \
+ 	: (((attlen) == -1) ? \
+ 	( \
+ 		VARSIZE_ANY(attptr) \
+ 	) \
+ 	: \
+ 	( \
+ 		AssertMacro((attlen) == -2), \
+ 		(strlen((char *) (attptr)) + 1) \
+ 	)) \
+ )
+ 
+ 
+ /*
   * store_att_byval is a partial inverse of fetch_att: store a given Datum
   * value into a tuple data area at the specified address.  However, it only
   * handles the byval case, because in typical usage the caller needs to
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 23,28 **** typedef struct PGLZ_Header
--- 23,30 ----
  	int32		rawsize;
  } PGLZ_Header;
  
+ #define PGLZ_HISTORY_SIZE		4096
+ #define PGLZ_MAX_MATCH			273
  
  /* ----------
   * PGLZ_MAX_OUTPUT -
***************
*** 86,91 **** typedef struct PGLZ_Strategy
--- 88,206 ----
  	int32		match_size_drop;
  } PGLZ_Strategy;
  
+ /*
+  * calculate the approximate length required for history encode tag for the
+  * given length
+  */
+ #define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+ ( \
+ 	((_len) < 17) ?	(3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH)))	\
+ )
+ 
+ /* ----------
+  * pglz_out_ctrl -
+  *
+  *		Outputs the last and allocates a new control byte if needed.
+  * ----------
+  */
+ #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+ do { \
+ 	if ((__ctrl & 0xff) == 0)												\
+ 	{																		\
+ 		*(__ctrlp) = __ctrlb;												\
+ 		__ctrlp = (__buf)++;												\
+ 		__ctrlb = 0;														\
+ 		__ctrl = 1; 														\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_literal -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+ do { \
+ 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 	*(_buf)++ = (unsigned char)(_byte); 									\
+ 	_ctrl <<= 1;															\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_tag -
+  *
+  *		Outputs a backward/history reference tag of 2-4 bytes (depending on
+  *		offset and length) to the destination buffer including the
+  *		appropriate control bit.
+  *
+  *		Split the process of backward/history reference as different chunks,
+  *		if the given lenght is more than max match and repeats the process
+  *		until the given length is processed.
+  *
+  *		If the matched history length is less than 3 bytes then add it as a
+  *		new data only during encoding instead of history reference. This occurs
+  *		only while framing delta record for wal update operation.
+  * ----------
+  */
+ #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_byte) \
+ do { \
+ 	int _mlen;																	\
+ 	int _total_len = (_len);													\
+ 	while (_total_len > 0)														\
+ 	{																			\
+ 		_mlen = _total_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _total_len;		\
+ 		if (_mlen < 3)															\
+ 		{																		\
+ 			(_byte) = (char *)(_byte) + (_off);									\
+ 			pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_mlen,(_byte));				\
+ 			break;																\
+ 		}																		\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 		_ctrlb |= _ctrl;														\
+ 		_ctrl <<= 1;															\
+ 		if (_mlen > 17)															\
+ 		{																		\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+ 			(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+ 			(_buf)[2] = (unsigned char)((_mlen) - 18);							\
+ 			(_buf) += 3;														\
+ 		} else {																\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mlen) - 3)); \
+ 			(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+ 			(_buf) += 2;														\
+ 		}																		\
+ 		_total_len -= _mlen;													\
+ 		(_off) += _mlen;														\
+ 	}																			\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_add -
+  *
+  *		Outputs a reference tag of 1 byte with length and the new data
+  *		to the destination buffer, including the appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ 	int32 _mlen;																\
+ 	int32 _total_len = (_len);															\
+ 	while (_total_len > 0)														\
+ 	{																		\
+ 		_mlen = _total_len > 255 ? 255 : _total_len;								\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);							\
+ 		_ctrl <<= 1;														\
+ 		(_buf)[0] = (unsigned char)(_mlen);									\
+ 		(_buf) += 1;														\
+ 		memcpy((_buf), (_byte), _mlen);										\
+ 		(_buf) += _mlen;													\
+ 		(_byte) += _mlen;													\
+ 		_total_len -= _mlen;													\
+ 	}																		\
+ } while (0)
+ 
  
  /* ----------
   * The standard strategies
***************
*** 108,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! 
  #endif   /* _PG_LZCOMPRESS_H_ */
--- 223,228 ----
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! extern void pglz_decompress_with_history(const char *source, char *dest,
! 				uint32 *destlen, const char *history);
  #endif   /* _PG_LZCOMPRESS_H_ */

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 13 years ago

In reply to: Amit kapila (#4)

Thank you.

heap_attr_get_length_and_check_equals:

- This function returns always false for attrnum <= 0 as whole
tuple or some system attrs comparison regardless of the real
result, which is a bit different from the anticipation which
the name gives. If you need to keep this optimization, it
should have the name more specific to the purpose.

The heap_attr_get_length_and_check_equals function is similar to heap_tuple_attr_equals,
the attrnum <= 0 check is required for heap_tuple_attr_equals.

Sorry, you're right.

haap_delta_encode:

- Some misleading variable names (like match_not_found),
some reatitions of similiar codelets (att_align_pointer, pglz_out_tag),
misleading slight difference of the meanings of variables of
similar names(old_off and new_off and the similar pairs),
and bit tricky use of pglz_out_add and pglz_out_tag with length = 0.

These are welcome to be modified for better readability.

The variable names are modified, please check them once.

The (att_align_pointer, pglz_out_tag) repetition code is added to take care of padding only incase of values are equal.
Use of pglz_out_add and pglz_out_tag with length = 0 is done because of code readability.

Oops! Sorry for mistake. My point was that the bases for old_off
(of match_off) and dp, not new_off. It is no unnatural. Namings
had not been the problem and the function was perfect as of the
last patch. I'd been confised by the asymmetry between match_off
to pglz_out_tag and dp to pglz_out_add.

Another change is also done to handle the history size of 2 bytes which is possible with the usage of LZ macro's for delta encoding.

Good catch. This seems to have been a potential bug which does no
harm when called from pglz_compress..

==========

Looking into wal_update_changes_mod_lz_v6.patch, I understand
that this patch experimentally adds literal data segment which
have more than single byte in PG-LZ algorithm. According to
pglz_find_match, memCMP is slower than 'while(*s && *s == *d)' if
len < 16 and I suppose it is probably true at least for 4 byte
length data. This is also applied on encoding side. If this mod
does no harm to performance, I want to see this applied also to
pglz_comress.

By the way, the comment on pg_lzcompress.c:690 seems to quite
differ from what the code does.

regards,

*1: http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C38285495B0@szxeml509-mbx

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Kyotaro HORIGUCHI (#5)

On Monday, December 10, 2012 2:41 PM Kyotaro HORIGUCHI wrote:

Thank you.

heap_attr_get_length_and_check_equals:

..

- This function returns always false for attrnum <= 0 as whole
tuple or some system attrs comparison regardless of the real
result, which is a bit different from the anticipation which
the name gives. If you need to keep this optimization, it
should have the name more specific to the purpose.

The heap_attr_get_length_and_check_equals function is similar to

heap_tuple_attr_equals,

the attrnum <= 0 check is required for heap_tuple_attr_equals.

Sorry, you're right.

haap_delta_encode:

- Some misleading variable names (like match_not_found),
some reatitions of similiar codelets (att_align_pointer,

pglz_out_tag),

misleading slight difference of the meanings of variables of
similar names(old_off and new_off and the similar pairs),
and bit tricky use of pglz_out_add and pglz_out_tag with length =

0.

These are welcome to be modified for better readability.

The variable names are modified, please check them once.

The (att_align_pointer, pglz_out_tag) repetition code is added to take

care of padding only incase of values are equal.

Use of pglz_out_add and pglz_out_tag with length = 0 is done because

of code readability.

Oops! Sorry for mistake. My point was that the bases for old_off
(of match_off) and dp, not new_off. It is no unnatural. Namings
had not been the problem and the function was perfect as of the
last patch.

I think new naming I have done are more meaningful, do you think I should
revert to previous patch one's.

I'd been confised by the asymmetry between match_off
to pglz_out_tag and dp to pglz_out_add.

If we see the usage of pglz_out_tag and pglz_out_literal in pglz_compress(),
it is same as I have used.

Another change is also done to handle the history size of 2 bytes

which is possible with the usage of LZ macro's for delta encoding.

Good catch. This seems to have been a potential bug which does no
harm when called from pglz_compress..

==========

Looking into wal_update_changes_mod_lz_v6.patch, I understand
that this patch experimentally adds literal data segment which
have more than single byte in PG-LZ algorithm. According to
pglz_find_match, memCMP is slower than 'while(*s && *s == *d)' if
len < 16 and I suppose it is probably true at least for 4 byte
length data. This is also applied on encoding side. If this mod
does no harm to performance, I want to see this applied also to
pglz_comress.

Where in pglz_comress(), do you want to see similar usage?
Or do you want to see such use in function
heap_attr_get_length_and_check_equals(), where it compares 2 attributes.

By the way, the comment on pg_lzcompress.c:690 seems to quite
differ from what the code does.

I shall fix this.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 13 years ago

In reply to: Amit Kapila (#6)

Hello, I took the perfomance figures for this patch.

CentOS6.3/Core i7
wal_level = archive, checkpoint_segments = 30 / 5min

A. Vanilla pgbench, postgres is HEAD
B. Vanilla pgbench, postgres is with this patch (wal_update_changes_lz_v5)
C. Modified pgbench(Long text), postgres is HEAD
D. Modified pgbench(Long text), postgres is with this patch

Running doing pgbench -s 10 -i, pgbench -c 20 -T 2400

#trans/s WAL MB WAL kB/tran
1A 437 1723 1.68
1B 435 (<1% slower than A) 1645 1.61 (96% of A)
1C 149 5073 14.6
1D 174 (17% faster than C) 5232 12.8 (88% of C)

Restoring with the wal archives yielded during the first test.

Recv sec s/trans
2A 61 0.0581
2B 62 0.0594 (2% slower than A)
2C 287 0.805
2D 314 0.750 (7% faster than C)

For vanilla pgbench, WAL size shrinks slightly and performance
seems very slightly worse than unpatched postgres(1A vs. 1B). It
can be safely say that no harm on performance even outside of the
effective range of this patch. On the other hand, the performance
gain becomes 17% within the effective range (1C vs. 1D).

Recovery performance looks to have the same tendency. It looks to
produce very small loss outside of the effective range (2A
vs. 2B) and significant gain within (2C vs. 2D ).

As a whole, this patch brings very large gain in its effective
range - e.g. updates of relatively small portions of tuples, but
negligible loss of performance is observed outside of its
effective range.

I'll mark this patch as 'Ready for Committer' as soon as I get
finished confirming the mod patch.

==========

I think new naming I have done are more meaningful, do you think I should
revert to previous patch one's.

New naming is more meaningful, and a bit long. I don't think it
should be reverted.

Looking into wal_update_changes_mod_lz_v6.patch, I understand
that this patch experimentally adds literal data segment which
have more than single byte in PG-LZ algorithm. According to
pglz_find_match, memCMP is slower than 'while(*s && *s == *d)' if
len < 16 and I suppose it is probably true at least for 4 byte
length data. This is also applied on encoding side. If this mod
does no harm to performance, I want to see this applied also to
pglz_comress.

Where in pglz_comress(), do you want to see similar usage?
Or do you want to see such use in function
heap_attr_get_length_and_check_equals(), where it compares 2 attributes.

My point was the format for literal segments. It seems to reduce
about an eighth of literal segments. But the effectiveness under
real environment does not promising.. Forget it. It's just a
fancy.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Kyotaro HORIGUCHI (#7)

On Friday, December 14, 2012 2:32 PM Kyotaro HORIGUCHI wrote:

Hello, I took the perfomance figures for this patch.

CentOS6.3/Core i7
wal_level = archive, checkpoint_segments = 30 / 5min

A. Vanilla pgbench, postgres is HEAD
B. Vanilla pgbench, postgres is with this patch
(wal_update_changes_lz_v5)
C. Modified pgbench(Long text), postgres is HEAD
D. Modified pgbench(Long text), postgres is with this patch

Running doing pgbench -s 10 -i, pgbench -c 20 -T 2400

#trans/s WAL MB WAL kB/tran
1A 437 1723 1.68
1B 435 (<1% slower than A) 1645 1.61 (96% of A)
1C 149 5073 14.6
1D 174 (17% faster than C) 5232 12.8 (88% of C)

Restoring with the wal archives yielded during the first test.

Recv sec s/trans
2A 61 0.0581
2B 62 0.0594 (2% slower than A)
2C 287 0.805
2D 314 0.750 (7% faster than C)

For vanilla pgbench, WAL size shrinks slightly and performance
seems very slightly worse than unpatched postgres(1A vs. 1B). It
can be safely say that no harm on performance even outside of the
effective range of this patch. On the other hand, the performance
gain becomes 17% within the effective range (1C vs. 1D).

Recovery performance looks to have the same tendency. It looks to
produce very small loss outside of the effective range (2A
vs. 2B) and significant gain within (2C vs. 2D ).

As a whole, this patch brings very large gain in its effective
range - e.g. updates of relatively small portions of tuples, but
negligible loss of performance is observed outside of its
effective range.

I'll mark this patch as 'Ready for Committer' as soon as I get
finished confirming the mod patch.

Thank you very much.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 13 years ago

In reply to: Amit Kapila (#6)

2 attachment(s)

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

And took the performance figures for 4 types of modification
versus 2 benchmarks.

I've see small performace gain (4-8% for execution, and 6-12% for
recovery) and 16% WAL shrink for modified pgbench enhances the
benefit of this patch.

On the other hand I've found no significant loss of performance
for execution and 4% reduction of WAL for original pgbench, but
there might be 4-8% performance loss for recovery.

Attached patches are listed below.

wal_update_changes_lz_v5.patch

Rather straight implement of wal compression using existing
pg_lz compress format.

wal_update_changes_mod_lz_v6_2.patch

Modify pg_lz to have bulk literal segment format which is
available only for WAL compression. Misplaced comment fixed.

The detail of performance follows.
=====
I've tested involving the mod patch and 'modified' mod
patch.

CentOS6.3/Core i7
wal_level = archive, checkpoint_segments = 30 / 5min

wal_update_changes_mod_lz_v6+ is the version in which memcpy for
segment shorter than 16 bytes to be copied by while(*s)
*d++=*s++.

postgres pgbench
A. HEAD Original
B. wal_update_changes_lz_v5 Original
C. wal_update_changes_mod_lz_v6 Original
D. wal_update_changes_mod_lz_v6+ Original
E. HEAD attached with this patch
F. wal_update_changes_lz_v5 attached with this patch
G. wal_update_changes_mod_lz_v6 attached with this patch
H. wal_update_changes_mod_lz_v6+ attached with this patch

Running doing pgbench -s 10 -i, pgbench -c 10 -j 10 -T 1200

#trans/s WAL MB WAL kB/tran
1A 346 760 1.87
1B 347 730 1.80 (96% of A)
1C 346 729 1.80 (96% of A)
1D 347 730 1.80 (96% of A)

1E 192 2790 6.20
1F 200 (4% faster than E) 2431 5.19 (84% of D)
1G 207 (8% faster than E) 2563 5.28 (85% of D)
1H 199 (4% faster than E) 2421 5.19 (84% of D)

Recovery time

Recv sec us/trans
2A 26 62.6
2B 27 64.8 (4% slower than A)
2C 28 67.4 (8% slower than A)
2D 26 62.4 (same as A)

2E 130 629
2F 149 579 ( 8% faster than E)
2G 128 592 ( 6% faster than E)
2H 130 553 (12% faster than E)

For vanilla pgbench, WAL size shrinks slightly and performance
seems same as unpatched postgres(1A vs. 1B, 1C, 1D). For modified
pgbench, WAL size shrinks by about 17% and performance seems to
have a gain by several percent.

Recovery performance looks to have the same tendency. It looks to
produce very small loss outside of the effective range (2A
vs. 2B, 2C) and significant gain within (2E vs. 2F, 2G, 2H).

As a whole, this patch brings very large gain in its effective
range - e.g. updates of relatively small portions in a tuple, but
negligible loss of performance is observed outside of its
effective range on the test machine. I suppose the losses will be
emphasized by the more higher performance of seq write of WAL
devices

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

wal_update_changes_lz_v5.patchtext/x-patch; charset=us-asciiDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,65 ****
--- 60,66 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
  
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
***************
*** 297,308 **** heap_attisnull(HeapTuple tup, int attnum)
  }
  
  /* ----------------
!  *		nocachegetattr
   *
!  *		This only gets called from fastgetattr() macro, in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
--- 298,310 ----
  }
  
  /* ----------------
!  *		nocachegetattr_with_len
   *
!  *		This only gets called in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor and
!  *		outputs the length of the attribute value.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
***************
*** 320,328 **** heap_attisnull(HeapTuple tup, int attnum)
   * ----------------
   */
  Datum
! nocachegetattr(HeapTuple tuple,
! 			   int attnum,
! 			   TupleDesc tupleDesc)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
--- 322,331 ----
   * ----------------
   */
  Datum
! nocachegetattr_with_len(HeapTuple tuple,
! 						int attnum,
! 						TupleDesc tupleDesc,
! 						Size *len)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
***************
*** 381,386 **** nocachegetattr(HeapTuple tuple,
--- 384,392 ----
  		 */
  		if (att[attnum]->attcacheoff >= 0)
  		{
+ 			if (len)
+ 				*len = att_getlength(att[attnum]->attlen,
+ 									 tp + att[attnum]->attcacheoff);
  			return fetchatt(att[attnum],
  							tp + att[attnum]->attcacheoff);
  		}
***************
*** 507,515 **** nocachegetattr(HeapTuple tuple,
--- 513,534 ----
  		}
  	}
  
+ 	if (len)
+ 		*len = att_getlength(att[attnum]->attlen, tp + off);
  	return fetchatt(att[attnum], tp + off);
  }
  
+ /*
+  *	nocachegetattr
+  */
+ Datum
+ nocachegetattr(HeapTuple tuple,
+ 			   int attnum,
+ 			   TupleDesc tupleDesc)
+ {
+ 	return nocachegetattr_with_len(tuple, attnum, tupleDesc, NULL);
+ }
+ 
  /* ----------------
   *		heap_getsysattr
   *
***************
*** 618,623 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 637,1015 ----
  }
  
  /*
+  * Check if the specified attribute's value is same in both given tuples.
+  * and outputs the length of the given attribute in both tuples.
+  */
+ bool
+ heap_attr_get_length_and_check_equals(TupleDesc tupdesc, int attrnum,
+ 									  HeapTuple tup1, HeapTuple tup2,
+ 									Size *tup1_attr_len, Size *tup2_attr_len)
+ {
+ 	Datum		value1,
+ 				value2;
+ 	bool		isnull1,
+ 				isnull2;
+ 	Form_pg_attribute att;
+ 
+ 	*tup1_attr_len = 0;
+ 	*tup2_attr_len = 0;
+ 
+ 	/*
+ 	 * If it's a whole-tuple reference, say "not equal".  It's not really
+ 	 * worth supporting this case, since it could only succeed after a no-op
+ 	 * update, which is hardly a case worth optimizing for.
+ 	 */
+ 	if (attrnum == 0)
+ 		return false;
+ 
+ 	/*
+ 	 * Likewise, automatically say "not equal" for any system attribute other
+ 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ 	 * chain, or even to be set correctly yet in the new tuple.
+ 	 */
+ 	if (attrnum < 0)
+ 	{
+ 		if (attrnum != ObjectIdAttributeNumber &&
+ 			attrnum != TableOidAttributeNumber)
+ 			return false;
+ 	}
+ 
+ 	/*
+ 	 * Extract the corresponding values.  XXX this is pretty inefficient if
+ 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
+ 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
+ 	 * work for system columns ...
+ 	 */
+ 	value1 = heap_getattr_with_len(tup1, attrnum, tupdesc, &isnull1, tup1_attr_len);
+ 	value2 = heap_getattr_with_len(tup2, attrnum, tupdesc, &isnull2, tup2_attr_len);
+ 
+ 	/*
+ 	 * If one value is NULL and other is not, then they are certainly not
+ 	 * equal
+ 	 */
+ 	if (isnull1 != isnull2)
+ 		return false;
+ 
+ 	/*
+ 	 * If both are NULL, they can be considered equal.
+ 	 */
+ 	if (isnull1)
+ 		return true;
+ 
+ 	/*
+ 	 * We do simple binary comparison of the two datums.  This may be overly
+ 	 * strict because there can be multiple binary representations for the
+ 	 * same logical value.	But we should be OK as long as there are no false
+ 	 * positives.  Using a type-specific equality operator is messy because
+ 	 * there could be multiple notions of equality in different operator
+ 	 * classes; furthermore, we cannot safely invoke user-defined functions
+ 	 * while holding exclusive buffer lock.
+ 	 */
+ 	if (attrnum <= 0)
+ 	{
+ 		/* The only allowed system columns are OIDs, so do this */
+ 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ 	}
+ 	else
+ 	{
+ 		Assert(attrnum <= tupdesc->natts);
+ 		att = tupdesc->attrs[attrnum - 1];
+ 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ 	}
+ }
+ 
+ /* ----------------
+  * heap_delta_encode
+  *		Forms an encoded data from old and new tuple with the modified columns
+  *		using an algorithm similar to LZ algorithm.
+  *
+  *		tupleDesc - Tuple descriptor.
+  *		oldtup - pointer to the old/history tuple.
+  *		newtup - pointer to the new tuple.
+  *		encdata - pointer to the encoded data using lz algorithm.
+  *
+  *		Encode the bitmap [+padding] [+oid] as a new data. And loop for all
+  *		attributes to find any modifications in the attributes.
+  *
+  *		The unmodified data is encoded as a history tag to the output and the
+  *		modifed data is encoded as new data to the output.
+  *
+  *		If the encoded output data is less than 75% of original data,
+  *		The output data is considered as encoded and proceed further.
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  PGLZ_Header *encdata)
+ {
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	int			numberOfAttributes;
+ 	int32		new_tup_off = 0,
+ 				old_tup_off = 0,
+ 				temp_off = 0,
+ 				match_off = 0,
+ 				change_off = 0;
+ 	int			attnum;
+ 	int32		data_len,
+ 				old_tup_pad_len,
+ 				new_tup_pad_len;
+ 	Size		old_tup_attr_len,
+ 				new_tup_attr_len;
+ 	bool		is_attr_equals = true;
+ 	unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+ 	unsigned char *bstart = bp;
+ 	char	   *dp = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	char	   *dstart = dp;
+ 	char	   *history;
+ 	unsigned char ctrl_dummy = 0;
+ 	unsigned char *ctrlp = &ctrl_dummy;
+ 	unsigned char ctrlb = 0;
+ 	unsigned char ctrl = 0;
+ 	int32		len,
+ 				old_tup_bitmaplen,
+ 				new_tup_bitmaplen,
+ 				new_tup_len;
+ 	int32		result_size;
+ 	int32		result_max;
+ 
+ 	/* Include the bitmap header in the lz encoded data. */
+ 	history = (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	old_tup_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * The maximum encoded data is of 75% of total size. The max tuple size is
+ 	 * already validated as it cannot be more than MaxHeapTupleSize.
+ 	 */
+ 	result_max = (new_tup_len * 75) / 100;
+ 	encdata->rawsize = new_tup_len;
+ 
+ 	/*
+ 	 * Check for output buffer is reached the result_max by advancing the
+ 	 * buffer by the calculated aproximate length for the corresponding
+ 	 * operation.
+ 	 */
+ 	if ((bp + (2 * new_tup_bitmaplen)) - bstart >= result_max)
+ 		return false;
+ 
+ 	/* Copy the bitmap data from new tuple to the encoded data buffer */
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_bitmaplen, dp);
+ 	dstart = dp;
+ 
+ 	numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+ 	for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+ 	{
+ 		/*
+ 		 * If the attribute is modified by the update operation, store the
+ 		 * appropiate offsets in the WAL record, otherwise skip to the next
+ 		 * attribute.
+ 		 */
+ 		if (!heap_attr_get_length_and_check_equals(tupleDesc, attnum, oldtup,
+ 							   newtup, &old_tup_attr_len, &new_tup_attr_len))
+ 		{
+ 			is_attr_equals = false;
+ 			data_len = old_tup_off - match_off;
+ 
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 			if ((bp + len) - bstart >= result_max)
+ 				return false;
+ 
+ 			/*
+ 			 * The match_off value is calculated w.r.t to the tuple t_hoff
+ 			 * value, the bit map len needs to be added to match_off to get
+ 			 * the actual start offfset from the old/history tuple.
+ 			 */
+ 			match_off += old_tup_bitmaplen;
+ 
+ 			/*
+ 			 * If any unchanged data presents in the old and new tuples then
+ 			 * encode the data as it needs to copy from history tuple with len
+ 			 * and offset.
+ 			 */
+ 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 
+ 			/*
+ 			 * Recalculate the old and new tuple offsets based on padding
+ 			 * present in the tuples
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				old_tup_off = att_align_pointer(old_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ 			}
+ 
+ 			if (!HeapTupleHasNulls(newtup)
+ 				|| !att_isnull((attnum - 1), newtup->t_data->t_bits))
+ 			{
+ 				new_tup_off = att_align_pointer(new_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ 			}
+ 
+ 			old_tup_off += old_tup_attr_len;
+ 			new_tup_off += new_tup_attr_len;
+ 
+ 			match_off = old_tup_off;
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Check for output buffer is reached the result_max by advancing
+ 			 * the buffer by the calculated aproximate length for the
+ 			 * corresponding operation.
+ 			 */
+ 			data_len = new_tup_off - change_off;
+ 			if ((bp + (2 * data_len)) - bstart >= result_max)
+ 				return false;
+ 
+ 			/* Copy the modified column data to the output buffer if present */
+ 			pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 			/*
+ 			 * calculate the old tuple field start position, required to
+ 			 * ignore if any alignmet is present.
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				temp_off = old_tup_off;
+ 				old_tup_off = att_align_pointer(old_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ 
+ 				old_tup_pad_len = old_tup_off - temp_off;
+ 
+ 				/*
+ 				 * calculate the new tuple field start position to check
+ 				 * whether any padding is required or not because field
+ 				 * alignment.
+ 				 */
+ 				temp_off = new_tup_off;
+ 				new_tup_off = att_align_pointer(new_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ 				new_tup_pad_len = new_tup_off - temp_off;
+ 
+ 				/*
+ 				 * Checking for that is there any alignment difference between
+ 				 * old and new tuple attributes.
+ 				 */
+ 				if (old_tup_pad_len != new_tup_pad_len)
+ 				{
+ 					/*
+ 					 * If the alignment difference is found between old and
+ 					 * new tuples and the last attribute value of the new
+ 					 * tuple is same as old tuple then write the encode as
+ 					 * history data until the current match.
+ 					 *
+ 					 * If the last attribute value of new tuple is not same as
+ 					 * old tuple then the matched data marking as history is
+ 					 * already taken care.
+ 					 */
+ 					if (is_attr_equals)
+ 					{
+ 						/*
+ 						 * Check for output buffer is reached the result_max
+ 						 * by advancing the buffer by the calculated
+ 						 * aproximate length for the corresponding operation.
+ 						 */
+ 						data_len = old_tup_off - old_tup_pad_len - match_off;
+ 						len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 						if ((bp + len) - bstart >= result_max)
+ 							return false;
+ 
+ 						match_off += old_tup_bitmaplen;
+ 						pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 					}
+ 
+ 					match_off = old_tup_off;
+ 
+ 					/* Alignment data */
+ 					if ((bp + (2 * new_tup_pad_len)) - bstart >= result_max)
+ 						return false;
+ 
+ 					pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_pad_len, dp);
+ 				}
+ 			}
+ 
+ 			old_tup_off += old_tup_attr_len;
+ 			new_tup_off += new_tup_attr_len;
+ 
+ 			change_off = new_tup_off;
+ 
+ 			/*
+ 			 * Recalculate the destination pointer with the new offset which
+ 			 * is used while copying the modified data.
+ 			 */
+ 			dp = dstart + new_tup_off;
+ 			is_attr_equals = true;
+ 		}
+ 	}
+ 
+ 	/* If any modified column data presents then copy it. */
+ 	data_len = new_tup_off - change_off;
+ 	if ((bp + (2 * data_len)) - bstart >= result_max)
+ 		return false;
+ 
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 	/* If any left out old tuple data presents then copy it as history */
+ 	data_len = old_tup_off - match_off;
+ 	len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 	if ((bp + len) - bstart >= result_max)
+ 		return false;
+ 
+ 	match_off += old_tup_bitmaplen;
+ 	pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 
+ 	/*
+ 	 * Write out the last control byte and check that we haven't overrun the
+ 	 * output size allowed by the strategy.
+ 	 */
+ 	*ctrlp = ctrlb;
+ 
+ 	result_size = bp - bstart;
+ 	if (result_size >= result_max)
+ 		return false;
+ 
+ 	/*
+ 	 * Success - need only fill in the actual length of the compressed datum.
+ 	 */
+ 	SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+ 	return true;
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *		Decodes the encoded data to dest tuple with the help of history.
+  *
+  *		encdata - Pointer to the encoded data.
+  *		oldtup - pointer to the history tuple.
+  *		newtup - pointer to the destination tuple.
+  * ----------------
+  */
+ void
+ heap_delta_decode(PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	return pglz_decompress_with_history((char *) encdata,
+ 			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 										&newtup->t_len,
+ 			(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
+ /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
   *		which are of the length indicated by tupleDescriptor->natts
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 85,90 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
--- 85,91 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+ 				HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
***************
*** 844,849 **** heapgettup_pagemode(HeapScanDesc scan,
--- 845,898 ----
   * definition in access/htup.h is maintained.
   */
  Datum
+ fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+ 					 bool *isnull, int32 *len)
+ {
+ 	return (
+ 			(attnum) > 0 ?
+ 			(
+ 			 (*(isnull) = false),
+ 			 HeapTupleNoNulls(tup) ?
+ 			 (
+ 			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
+ 			  (
+ 			(*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
+ 							 (char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ 							 (tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
+ 			   fetchatt((tupleDesc)->attrs[(attnum) - 1],
+ 						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ 						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
+ 			   )
+ 			  :
+ 			  (
+ 			   nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
+ 			  )
+ 			 :
+ 			 (
+ 			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
+ 			  (
+ 			   (*(isnull) = true),
+ 			   (*(len) = 0),
+ 			   (Datum) NULL
+ 			   )
+ 			  :
+ 			  (
+ 			   nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))
+ 			   )
+ 			  )
+ 			 )
+ 			:
+ 			(
+ 			 (Datum) NULL
+ 			 )
+ 		);
+ }
+ 
+ /*
+  * This is formatted so oddly so that the correspondence to the macro
+  * definition in access/htup.h is maintained.
+  */
+ Datum
  fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull)
  {
***************
*** 860,866 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr((tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
--- 909,916 ----
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  (
! 			   nocachegetattr(tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
***************
*** 2383,2389 **** simple_heap_insert(Relation relation, HeapTuple tup)
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			CommandId cid, Snapshot crosscheck, bool wait,
! 			HeapUpdateFailureData *hufd)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
--- 2433,2439 ----
  HTSU_Result
  heap_delete(Relation relation, ItemPointer tid,
  			CommandId cid, Snapshot crosscheck, bool wait,
! 			HeapUpdateFailureData * hufd)
  {
  	HTSU_Result result;
  	TransactionId xid = GetCurrentTransactionId();
***************
*** 3212,3221 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3262,3273 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 
! 		recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 								 newbuf, heaptup, &oldtup,
! 								 all_visible_cleared,
! 								 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
***************
*** 3282,3355 **** static bool
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Datum		value1,
! 				value2;
! 	bool		isnull1,
! 				isnull2;
! 	Form_pg_attribute att;
! 
! 	/*
! 	 * If it's a whole-tuple reference, say "not equal".  It's not really
! 	 * worth supporting this case, since it could only succeed after a no-op
! 	 * update, which is hardly a case worth optimizing for.
! 	 */
! 	if (attrnum == 0)
! 		return false;
! 
! 	/*
! 	 * Likewise, automatically say "not equal" for any system attribute other
! 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
! 	 * chain, or even to be set correctly yet in the new tuple.
! 	 */
! 	if (attrnum < 0)
! 	{
! 		if (attrnum != ObjectIdAttributeNumber &&
! 			attrnum != TableOidAttributeNumber)
! 			return false;
! 	}
  
! 	/*
! 	 * Extract the corresponding values.  XXX this is pretty inefficient if
! 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
! 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
! 	 * work for system columns ...
! 	 */
! 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
! 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
! 
! 	/*
! 	 * If one value is NULL and other is not, then they are certainly not
! 	 * equal
! 	 */
! 	if (isnull1 != isnull2)
! 		return false;
! 
! 	/*
! 	 * If both are NULL, they can be considered equal.
! 	 */
! 	if (isnull1)
! 		return true;
! 
! 	/*
! 	 * We do simple binary comparison of the two datums.  This may be overly
! 	 * strict because there can be multiple binary representations for the
! 	 * same logical value.	But we should be OK as long as there are no false
! 	 * positives.  Using a type-specific equality operator is messy because
! 	 * there could be multiple notions of equality in different operator
! 	 * classes; furthermore, we cannot safely invoke user-defined functions
! 	 * while holding exclusive buffer lock.
! 	 */
! 	if (attrnum <= 0)
! 	{
! 		/* The only allowed system columns are OIDs, so do this */
! 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
! 	}
! 	else
! 	{
! 		Assert(attrnum <= tupdesc->natts);
! 		att = tupdesc->attrs[attrnum - 1];
! 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
! 	}
  }
  
  /*
--- 3334,3344 ----
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Size		tup1_attr_len,
! 				tup2_attr_len;
  
! 	return heap_attr_get_length_and_check_equals(tupdesc, attrnum, tup1, tup2,
! 											 &tup1_attr_len, &tup2_attr_len);
  }
  
  /*
***************
*** 4447,4453 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
--- 4436,4442 ----
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
***************
*** 4456,4461 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 4445,4461 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	int			oldtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds max output possible from the LZ algorithm */
+ 	struct
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[MaxHeapTupleSize];
+ 	}			buf;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 4465,4475 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4465,4505 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 	oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/* Is the update is going to the same page? */
+ 	if (oldbuf == newbuf)
+ 	{
+ 		/*
+ 		 * LZ algorithm can hold only history offset in the range of 1 - 4095.
+ 		 * so the delta encode is restricted for the tuples with length more
+ 		 * than PGLZ_HISTORY_SIZE.
+ 		 */
+ 		if (oldtuplen < PGLZ_HISTORY_SIZE)
+ 		{
+ 			/* Delta-encode the new tuple using the old tuple */
+ 			if (heap_delta_encode(reln->rd_att, oldtup, newtup,
+ 								  &buf.pglzheader))
+ 			{
+ 				compressed = true;
+ 				newtupdata = (char *) &buf.pglzheader;
+ 				newtuplen = VARSIZE(&buf.pglzheader);
+ 			}
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 4496,4504 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 4526,4537 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! 	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 5274,5280 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5307,5316 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 5289,5295 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5325,5331 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5349,5355 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
--- 5385,5391 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
***************
*** 5368,5374 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5404,5410 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5393,5399 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5429,5435 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5456,5465 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 5492,5520 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the new tuple was delta-encoded, decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/* PG93FORMAT: LZ header + Encoded data */
! 		PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
! 
! 		oldtup.t_data = oldtupdata;
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode(encoded_data, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 5474,5480 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 5529,5535 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 182,190 ****
   */
  #define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
  #define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
- #define PGLZ_HISTORY_SIZE		4096
- #define PGLZ_MAX_MATCH			273
- 
  
  /* ----------
   * PGLZ_HistEntry -
--- 182,187 ----
***************
*** 302,368 **** do {									\
  			}																\
  } while (0)
  
- 
- /* ----------
-  * pglz_out_ctrl -
-  *
-  *		Outputs the last and allocates a new control byte if needed.
-  * ----------
-  */
- #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
- do { \
- 	if ((__ctrl & 0xff) == 0)												\
- 	{																		\
- 		*(__ctrlp) = __ctrlb;												\
- 		__ctrlp = (__buf)++;												\
- 		__ctrlb = 0;														\
- 		__ctrl = 1;															\
- 	}																		\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_literal -
-  *
-  *		Outputs a literal byte to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	*(_buf)++ = (unsigned char)(_byte);										\
- 	_ctrl <<= 1;															\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_tag -
-  *
-  *		Outputs a backward reference tag of 2-4 bytes (depending on
-  *		offset and length) to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	_ctrlb |= _ctrl;														\
- 	_ctrl <<= 1;															\
- 	if (_len > 17)															\
- 	{																		\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
- 		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
- 		(_buf)[2] = (unsigned char)((_len) - 18);							\
- 		(_buf) += 3;														\
- 	} else {																\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
- 		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
- 		(_buf) += 2;														\
- 	}																		\
- } while (0)
- 
- 
  /* ----------
   * pglz_find_match -
   *
--- 299,304 ----
***************
*** 595,601 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			 * Create the tag and add history entries for all matched
  			 * characters.
  			 */
! 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
  			while (match_len--)
  			{
  				pglz_hist_add(hist_start, hist_entries,
--- 531,537 ----
  			 * Create the tag and add history entries for all matched
  			 * characters.
  			 */
! 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off, dp);
  			while (match_len--)
  			{
  				pglz_hist_add(hist_start, hist_entries,
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 583,620 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest.
+  *		To decompress, it uses history if provided.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 699,714 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  					break;
  				}
  
! 				/*
! 				 * Now we copy the bytes specified by the tag from OUTPUT to
! 				 * OUTPUT. It is dangerous and platform dependent to use
! 				 * memcpy() here, because the copied areas could overlap
! 				 * extremely!
! 				 */
! 				while (len--)
  				{
! 					*dp = dp[-off];
! 					dp++;
  				}
  			}
  			else
--- 658,685 ----
  					break;
  				}
  
! 				if (history)
! 				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from history to
! 					 * OUTPUT.
! 					 */
! 					memcpy(dp, history + off, len);
! 					dp += len;
! 				}
! 				else
  				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from OUTPUT to
! 					 * OUTPUT. It is dangerous and platform dependent to use
! 					 * memcpy() here, because the copied areas could overlap
! 					 * extremely!
! 					 */
! 					while (len--)
! 					{
! 						*dp = dp[-off];
! 						dp++;
! 					}
  				}
  			}
  			else
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 142,161 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	char		flags;			/* flag bits, see below */
! 
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! 
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01 /* Indicates as old page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02 /* Indicates as new page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04 /* Indicates as the update
! 												operation is delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 18,23 ****
--- 18,24 ----
  #include "access/tupdesc.h"
  #include "access/tupmacs.h"
  #include "storage/bufpage.h"
+ #include "utils/pg_lzcompress.h"
  
  /*
   * MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
***************
*** 528,533 **** struct MinimalTupleData
--- 529,535 ----
  		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
  
  
+ #if !defined(DISABLE_COMPLEX_MACRO)
  /* ----------------
   *		fastgetattr
   *
***************
*** 542,550 **** struct MinimalTupleData
   *		lookups, and call nocachegetattr() for the rest.
   * ----------------
   */
- 
- #if !defined(DISABLE_COMPLEX_MACRO)
- 
  #define fastgetattr(tup, attnum, tupleDesc, isnull)					\
  (																	\
  	AssertMacro((attnum) > 0),										\
--- 544,549 ----
***************
*** 572,585 **** struct MinimalTupleData
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
  )
- #else							/* defined(DISABLE_COMPLEX_MACRO) */
  
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
- 
  /* ----------------
   *		heap_getattr
   *
--- 571,626 ----
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
+ )																	\
+ 
+ /* ----------------
+  *		fastgetattr_with_len
+  *
+  *		Similar to fastgetattr and fetches the length of the given attribute
+  *		also.
+  * ----------------
+  */
+ #define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len)	\
+ (																	\
+ 	AssertMacro((attnum) > 0),										\
+ 	(*(isnull) = false),											\
+ 	HeapTupleNoNulls(tup) ? 										\
+ 	(																\
+ 		(tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ?			\
+ 		(															\
+ 			(*(len) = att_getlength(								\
+ 					(tupleDesc)->attrs[(attnum)-1]->attlen, 		\
+ 					(char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)),	\
+ 			fetchatt((tupleDesc)->attrs[(attnum)-1],				\
+ 				(char *) (tup)->t_data + (tup)->t_data->t_hoff +	\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)	\
+ 		)															\
+ 		:															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 	)																\
+ 	:																\
+ 	(																\
+ 		att_isnull((attnum)-1, (tup)->t_data->t_bits) ? 			\
+ 		(															\
+ 			(*(isnull) = true), 									\
+ 			(*(len) = 0),											\
+ 			(Datum)NULL 											\
+ 		)															\
+ 		:															\
+ 		(															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 		)															\
+ 	)																\
  )
  
+ #else							/* defined(DISABLE_COMPLEX_MACRO) */
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
+ extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+ 			TupleDesc tupleDesc, bool *isnull, int32 *len);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  /* ----------------
   *		heap_getattr
   *
***************
*** 596,616 **** extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
  	( \
! 		((attnum) > 0) ? \
  		( \
! 			((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
! 			( \
! 				(*(isnull) = true), \
! 				(Datum)NULL \
! 			) \
! 			: \
! 				fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
  		) \
  		: \
! 			heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	)
  
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
--- 637,679 ----
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
+ ( \
+ 	((attnum) > 0) ? \
  	( \
! 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
  		( \
! 			(*(isnull) = true), \
! 			(Datum)NULL \
  		) \
  		: \
! 			fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	) \
! 	: \
! 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
  
+ /* ----------------
+  *		heap_getattr_with_len
+  *
+  *		Similar to heap_getattr and outputs the length of the given attribute.
+  * ----------------
+  */
+ #define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ 	((attnum) > 0) ? \
+ 	( \
+ 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+ 		( \
+ 			(*(isnull) = true), \
+ 			(*(len) = 0), \
+ 			(Datum)NULL \
+ 		) \
+ 		: \
+ 			fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+ 	) \
+ 	: \
+ 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+ )
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
***************
*** 620,625 **** extern void heap_fill_tuple(TupleDesc tupleDesc,
--- 683,690 ----
  				char *data, Size data_size,
  				uint16 *infomask, bits8 *bit);
  extern bool heap_attisnull(HeapTuple tup, int attnum);
+ extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+ 			   TupleDesc att, Size *len);
  extern Datum nocachegetattr(HeapTuple tup, int attnum,
  			   TupleDesc att);
  extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
***************
*** 636,641 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 701,714 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_attr_get_length_and_check_equals(TupleDesc tupdesc,
+ 						int attrnum, HeapTuple tup1, HeapTuple tup2,
+ 						Size *tup1_attr_len, Size *tup2_attr_len);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				HeapTuple newtup, PGLZ_Header *encdata);
+ extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+ 				HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/tupmacs.h
--- b/src/include/access/tupmacs.h
***************
*** 187,192 ****
--- 187,214 ----
  )
  
  /*
+  * att_getlength -
+  *			Gets the length of the attribute.
+  */
+ #define att_getlength(attlen, attptr) \
+ ( \
+ 	((attlen) > 0) ? \
+ 	( \
+ 		(attlen) \
+ 	) \
+ 	: (((attlen) == -1) ? \
+ 	( \
+ 		VARSIZE_ANY(attptr) \
+ 	) \
+ 	: \
+ 	( \
+ 		AssertMacro((attlen) == -2), \
+ 		(strlen((char *) (attptr)) + 1) \
+ 	)) \
+ )
+ 
+ 
+ /*
   * store_att_byval is a partial inverse of fetch_att: store a given Datum
   * value into a tuple data area at the specified address.  However, it only
   * handles the byval case, because in typical usage the caller needs to
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 23,28 **** typedef struct PGLZ_Header
--- 23,30 ----
  	int32		rawsize;
  } PGLZ_Header;
  
+ #define PGLZ_HISTORY_SIZE		4096
+ #define PGLZ_MAX_MATCH			273
  
  /* ----------
   * PGLZ_MAX_OUTPUT -
***************
*** 86,91 **** typedef struct PGLZ_Strategy
--- 88,198 ----
  	int32		match_size_drop;
  } PGLZ_Strategy;
  
+ /*
+  * calculate the approximate length required for history encode tag for the
+  * given length
+  */
+ #define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+ ( \
+ 	((_len) < 17) ?	(3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH)))	\
+ )
+ 
+ /* ----------
+  * pglz_out_ctrl -
+  *
+  *		Outputs the last and allocates a new control byte if needed.
+  * ----------
+  */
+ #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+ do { \
+ 	if ((__ctrl & 0xff) == 0)												\
+ 	{																		\
+ 		*(__ctrlp) = __ctrlb;												\
+ 		__ctrlp = (__buf)++;												\
+ 		__ctrlb = 0;														\
+ 		__ctrl = 1; 														\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_literal -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+ do { \
+ 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 	*(_buf)++ = (unsigned char)(_byte); 									\
+ 	_ctrl <<= 1;															\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_tag -
+  *
+  *		Outputs a backward/history reference tag of 2-4 bytes (depending on
+  *		offset and length) to the destination buffer including the
+  *		appropriate control bit.
+  *
+  *		Split the process of backward/history reference as different chunks,
+  *		if the given lenght is more than max match and repeats the process
+  *		until the given length is processed.
+  *
+  *		If the matched history length is less than 3 bytes then add it as a
+  *		new data only during encoding instead of history reference. This occurs
+  *		only while framing delta record for wal update operation.
+  * ----------
+  */
+ #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_byte) \
+ do { \
+ 	int _mlen;																	\
+ 	int _total_len = (_len);													\
+ 	while (_total_len > 0)														\
+ 	{																			\
+ 		_mlen = _total_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _total_len;		\
+ 		if (_mlen < 3)															\
+ 		{																		\
+ 			(_byte) = (char *)(_byte) + (_off);									\
+ 			pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_mlen,(_byte));				\
+ 			break;																\
+ 		}																		\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 		_ctrlb |= _ctrl;														\
+ 		_ctrl <<= 1;															\
+ 		if (_mlen > 17)															\
+ 		{																		\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+ 			(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+ 			(_buf)[2] = (unsigned char)((_mlen) - 18);							\
+ 			(_buf) += 3;														\
+ 		} else {																\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mlen) - 3)); \
+ 			(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+ 			(_buf) += 2;														\
+ 		}																		\
+ 		_total_len -= _mlen;													\
+ 		(_off) += _mlen;														\
+ 	}																			\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_add -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit until the given input length.
+  * ----------
+  */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ 	int32 _total_len = (_len);											\
+ 	while (_total_len-- > 0)											\
+ 	{																	\
+ 		pglz_out_literal(_ctrlp, _ctrlb, _ctrl, _buf, *(_byte));		\
+ 		(_byte) = (char *)(_byte) + 1;									\
+ 	}																	\
+ } while (0)
+ 
  
  /* ----------
   * The standard strategies
***************
*** 108,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! 
  #endif   /* _PG_LZCOMPRESS_H_ */
--- 215,220 ----
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! extern void pglz_decompress_with_history(const char *source, char *dest,
! 				uint32 *destlen, const char *history);
  #endif   /* _PG_LZCOMPRESS_H_ */

wal_update_changes_mod_lz_v6_2.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 034dfe5..83bd03d 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -297,12 +298,13 @@ heap_attisnull(HeapTuple tup, int attnum)
 }
 
 /* ----------------
- *		nocachegetattr
+ *		nocachegetattr_with_len
  *
- *		This only gets called from fastgetattr() macro, in cases where
+ *		This only gets called in cases where
  *		we can't use a cacheoffset and the value is not null.
  *
- *		This caches attribute offsets in the attribute descriptor.
+ *		This caches attribute offsets in the attribute descriptor and
+ *		outputs the length of the attribute value.
  *
  *		An alternative way to speed things up would be to cache offsets
  *		with the tuple, but that seems more difficult unless you take
@@ -320,9 +322,10 @@ heap_attisnull(HeapTuple tup, int attnum)
  * ----------------
  */
 Datum
-nocachegetattr(HeapTuple tuple,
-			   int attnum,
-			   TupleDesc tupleDesc)
+nocachegetattr_with_len(HeapTuple tuple,
+						int attnum,
+						TupleDesc tupleDesc,
+						Size *len)
 {
 	HeapTupleHeader tup = tuple->t_data;
 	Form_pg_attribute *att = tupleDesc->attrs;
@@ -381,6 +384,9 @@ nocachegetattr(HeapTuple tuple,
 		 */
 		if (att[attnum]->attcacheoff >= 0)
 		{
+			if (len)
+				*len = att_getlength(att[attnum]->attlen,
+									 tp + att[attnum]->attcacheoff);
 			return fetchatt(att[attnum],
 							tp + att[attnum]->attcacheoff);
 		}
@@ -507,9 +513,22 @@ nocachegetattr(HeapTuple tuple,
 		}
 	}
 
+	if (len)
+		*len = att_getlength(att[attnum]->attlen, tp + off);
 	return fetchatt(att[attnum], tp + off);
 }
 
+/*
+ *	nocachegetattr
+ */
+Datum
+nocachegetattr(HeapTuple tuple,
+			   int attnum,
+			   TupleDesc tupleDesc)
+{
+	return nocachegetattr_with_len(tuple, attnum, tupleDesc, NULL);
+}
+
 /* ----------------
  *		heap_getsysattr
  *
@@ -618,6 +637,379 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 }
 
 /*
+ * Check if the specified attribute's value is same in both given tuples.
+ * and outputs the length of the given attribute in both tuples.
+ */
+bool
+heap_attr_get_length_and_check_equals(TupleDesc tupdesc, int attrnum,
+									  HeapTuple tup1, HeapTuple tup2,
+									Size *tup1_attr_len, Size *tup2_attr_len)
+{
+	Datum		value1,
+				value2;
+	bool		isnull1,
+				isnull2;
+	Form_pg_attribute att;
+
+	*tup1_attr_len = 0;
+	*tup2_attr_len = 0;
+
+	/*
+	 * If it's a whole-tuple reference, say "not equal".  It's not really
+	 * worth supporting this case, since it could only succeed after a no-op
+	 * update, which is hardly a case worth optimizing for.
+	 */
+	if (attrnum == 0)
+		return false;
+
+	/*
+	 * Likewise, automatically say "not equal" for any system attribute other
+	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+	 * chain, or even to be set correctly yet in the new tuple.
+	 */
+	if (attrnum < 0)
+	{
+		if (attrnum != ObjectIdAttributeNumber &&
+			attrnum != TableOidAttributeNumber)
+			return false;
+	}
+
+	/*
+	 * Extract the corresponding values.  XXX this is pretty inefficient if
+	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
+	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
+	 * work for system columns ...
+	 */
+	value1 = heap_getattr_with_len(tup1, attrnum, tupdesc, &isnull1, tup1_attr_len);
+	value2 = heap_getattr_with_len(tup2, attrnum, tupdesc, &isnull2, tup2_attr_len);
+
+	/*
+	 * If one value is NULL and other is not, then they are certainly not
+	 * equal
+	 */
+	if (isnull1 != isnull2)
+		return false;
+
+	/*
+	 * If both are NULL, they can be considered equal.
+	 */
+	if (isnull1)
+		return true;
+
+	/*
+	 * We do simple binary comparison of the two datums.  This may be overly
+	 * strict because there can be multiple binary representations for the
+	 * same logical value.	But we should be OK as long as there are no false
+	 * positives.  Using a type-specific equality operator is messy because
+	 * there could be multiple notions of equality in different operator
+	 * classes; furthermore, we cannot safely invoke user-defined functions
+	 * while holding exclusive buffer lock.
+	 */
+	if (attrnum <= 0)
+	{
+		/* The only allowed system columns are OIDs, so do this */
+		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+	}
+	else
+	{
+		Assert(attrnum <= tupdesc->natts);
+		att = tupdesc->attrs[attrnum - 1];
+		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+	}
+}
+
+/* ----------------
+ * heap_delta_encode
+ *		Forms an encoded data from old and new tuple with the modified columns
+ *		using an algorithm similar to LZ algorithm.
+ *
+ *		tupleDesc - Tuple descriptor.
+ *		oldtup - pointer to the old/history tuple.
+ *		newtup - pointer to the new tuple.
+ *		encdata - pointer to the encoded data using lz algorithm.
+ *
+ *		Encode the bitmap [+padding] [+oid] as a new data. And loop for all
+ *		attributes to find any modifications in the attributes.
+ *
+ *		The unmodified data is encoded as a history tag to the output and the
+ *		modifed data is encoded as new data to the output.
+ *
+ *		If the encoded output data is less than 75% of original data,
+ *		The output data is considered as encoded and proceed further.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  PGLZ_Header *encdata)
+{
+	Form_pg_attribute *att = tupleDesc->attrs;
+	int			numberOfAttributes;
+	int32		new_tup_off = 0,
+				old_tup_off = 0,
+				temp_off = 0,
+				match_off = 0,
+				change_off = 0;
+	int			attnum;
+	int32		data_len,
+				old_tup_pad_len,
+				new_tup_pad_len;
+	Size		old_tup_attr_len,
+				new_tup_attr_len;
+	bool		is_attr_equals = true;
+	unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+	unsigned char *bstart = bp;
+	char	   *dp = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+	char	   *dstart = dp;
+	char	   *history;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	int32		len,
+				old_tup_bitmaplen,
+				new_tup_bitmaplen,
+				new_tup_len;
+	int32		result_size;
+	int32		result_max;
+
+	/* Include the bitmap header in the lz encoded data. */
+	history = (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+	old_tup_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+	new_tup_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+	new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * The maximum encoded data is of 75% of total size. The max tuple size is
+	 * already validated as it cannot be more than MaxHeapTupleSize.
+	 */
+	result_max = (new_tup_len * 75) / 100;
+	encdata->rawsize = new_tup_len;
+
+	/*
+	 * Check for output buffer is reached the result_max by advancing the
+	 * buffer by the calculated aproximate length for the corresponding
+	 * operation.
+	 */
+	if ((bp + (2 * new_tup_bitmaplen)) - bstart >= result_max)
+		return false;
+
+	/* Copy the bitmap data from new tuple to the encoded data buffer */
+	pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_bitmaplen, dp);
+	dstart = dp;
+
+	numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+	for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+	{
+		/*
+		 * If the attribute is modified by the update operation, store the
+		 * appropiate offsets in the WAL record, otherwise skip to the next
+		 * attribute.
+		 */
+		if (!heap_attr_get_length_and_check_equals(tupleDesc, attnum, oldtup,
+							   newtup, &old_tup_attr_len, &new_tup_attr_len))
+		{
+			is_attr_equals = false;
+			data_len = old_tup_off - match_off;
+
+			/*
+			 * Check for output buffer is reached the result_max by advancing
+			 * the buffer by the calculated aproximate length for the
+			 * corresponding operation.
+			 */
+			len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+			if ((bp + len) - bstart >= result_max)
+				return false;
+
+			/*
+			 * The match_off value is calculated w.r.t to the tuple t_hoff
+			 * value, the bit map len needs to be added to match_off to get
+			 * the actual start offfset from the old/history tuple.
+			 */
+			match_off += old_tup_bitmaplen;
+
+			/*
+			 * If any unchanged data presents in the old and new tuples then
+			 * encode the data as it needs to copy from history tuple with len
+			 * and offset.
+			 */
+			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+
+			/*
+			 * Recalculate the old and new tuple offsets based on padding
+			 * present in the tuples
+			 */
+			if (!HeapTupleHasNulls(oldtup)
+				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+			{
+				old_tup_off = att_align_pointer(old_tup_off,
+												att[attnum - 1]->attalign,
+												att[attnum - 1]->attlen,
+												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+			}
+
+			if (!HeapTupleHasNulls(newtup)
+				|| !att_isnull((attnum - 1), newtup->t_data->t_bits))
+			{
+				new_tup_off = att_align_pointer(new_tup_off,
+												att[attnum - 1]->attalign,
+												att[attnum - 1]->attlen,
+												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+			}
+
+			old_tup_off += old_tup_attr_len;
+			new_tup_off += new_tup_attr_len;
+
+			match_off = old_tup_off;
+		}
+		else
+		{
+			/*
+			 * Check for output buffer is reached the result_max by advancing
+			 * the buffer by the calculated aproximate length for the
+			 * corresponding operation.
+			 */
+			data_len = new_tup_off - change_off;
+			if ((bp + (2 * data_len)) - bstart >= result_max)
+				return false;
+
+			/* Copy the modified column data to the output buffer if present */
+			pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+
+			/*
+			 * calculate the old tuple field start position, required to
+			 * ignore if any alignmet is present.
+			 */
+			if (!HeapTupleHasNulls(oldtup)
+				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+			{
+				temp_off = old_tup_off;
+				old_tup_off = att_align_pointer(old_tup_off,
+												att[attnum - 1]->attalign,
+												att[attnum - 1]->attlen,
+												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+
+				old_tup_pad_len = old_tup_off - temp_off;
+
+				/*
+				 * calculate the new tuple field start position to check
+				 * whether any padding is required or not because field
+				 * alignment.
+				 */
+				temp_off = new_tup_off;
+				new_tup_off = att_align_pointer(new_tup_off,
+												att[attnum - 1]->attalign,
+												att[attnum - 1]->attlen,
+												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+				new_tup_pad_len = new_tup_off - temp_off;
+
+				/*
+				 * Checking for that is there any alignment difference between
+				 * old and new tuple attributes.
+				 */
+				if (old_tup_pad_len != new_tup_pad_len)
+				{
+					/*
+					 * If the alignment difference is found between old and
+					 * new tuples and the last attribute value of the new
+					 * tuple is same as old tuple then write the encode as
+					 * history data until the current match.
+					 *
+					 * If the last attribute value of new tuple is not same as
+					 * old tuple then the matched data marking as history is
+					 * already taken care.
+					 */
+					if (is_attr_equals)
+					{
+						/*
+						 * Check for output buffer is reached the result_max
+						 * by advancing the buffer by the calculated
+						 * aproximate length for the corresponding operation.
+						 */
+						data_len = old_tup_off - old_tup_pad_len - match_off;
+						len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+						if ((bp + len) - bstart >= result_max)
+							return false;
+
+						match_off += old_tup_bitmaplen;
+						pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+					}
+
+					match_off = old_tup_off;
+
+					/* Alignment data */
+					if ((bp + (2 * new_tup_pad_len)) - bstart >= result_max)
+						return false;
+
+					pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_pad_len, dp);
+				}
+			}
+
+			old_tup_off += old_tup_attr_len;
+			new_tup_off += new_tup_attr_len;
+
+			change_off = new_tup_off;
+
+			/*
+			 * Recalculate the destination pointer with the new offset which
+			 * is used while copying the modified data.
+			 */
+			dp = dstart + new_tup_off;
+			is_attr_equals = true;
+		}
+	}
+
+	/* If any modified column data presents then copy it. */
+	data_len = new_tup_off - change_off;
+	if ((bp + (2 * data_len)) - bstart >= result_max)
+		return false;
+
+	pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+
+	/* If any left out old tuple data presents then copy it as history */
+	data_len = old_tup_off - match_off;
+	len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+	if ((bp + len) - bstart >= result_max)
+		return false;
+
+	match_off += old_tup_bitmaplen;
+	pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+
+	result_size = bp - bstart;
+	if (result_size >= result_max)
+		return false;
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+	return true;
+}
+
+/* ----------------
+ * heap_delta_decode
+ *		Decodes the encoded data to dest tuple with the help of history.
+ *
+ *		encdata - Pointer to the encoded data.
+ *		oldtup - pointer to the history tuple.
+ *		newtup - pointer to the destination tuple.
+ * ----------------
+ */
+void
+heap_delta_decode(PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_decompress_with_history((char *) encdata,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+										&newtup->t_len,
+			(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+}
+
+/*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
  *		which are of the length indicated by tupleDescriptor->natts
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 186fb87..46a0d26 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -85,6 +85,7 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
 				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+				HeapTuple oldtup,
 				bool all_visible_cleared, bool new_all_visible_cleared);
 static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
 					   HeapTuple oldtup, HeapTuple newtup);
@@ -857,6 +858,54 @@ heapgettup_pagemode(HeapScanDesc scan,
  * definition in access/htup.h is maintained.
  */
 Datum
+fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+					 bool *isnull, int32 *len)
+{
+	return (
+			(attnum) > 0 ?
+			(
+			 (*(isnull) = false),
+			 HeapTupleNoNulls(tup) ?
+			 (
+			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
+			  (
+			(*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
+							 (char *) (tup)->t_data + (tup)->t_data->t_hoff +
+							 (tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
+			   fetchatt((tupleDesc)->attrs[(attnum) - 1],
+						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
+						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
+			   )
+			  :
+			  (
+			   nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
+			  )
+			 :
+			 (
+			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
+			  (
+			   (*(isnull) = true),
+			   (*(len) = 0),
+			   (Datum) NULL
+			   )
+			  :
+			  (
+			   nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))
+			   )
+			  )
+			 )
+			:
+			(
+			 (Datum) NULL
+			 )
+		);
+}
+
+/*
+ * This is formatted so oddly so that the correspondence to the macro
+ * definition in access/htup.h is maintained.
+ */
+Datum
 fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
 			bool *isnull)
 {
@@ -873,7 +922,8 @@ fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
 						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
 			   )
 			  :
-			  nocachegetattr((tup), (attnum), (tupleDesc))
+			  (
+			   nocachegetattr(tup), (attnum), (tupleDesc))
 			  )
 			 :
 			 (
@@ -2400,7 +2450,7 @@ simple_heap_insert(Relation relation, HeapTuple tup)
 HTSU_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData * hufd)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -2702,7 +2752,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
-						 true /* wait for commit */,
+						 true /* wait for commit */ ,
 						 &hufd);
 	switch (result)
 	{
@@ -2759,7 +2809,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 HTSU_Result
 heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			HeapUpdateFailureData *hufd)
+			HeapUpdateFailureData * hufd)
 {
 	HTSU_Result result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3229,10 +3279,12 @@ l2:
 	/* XLOG stuff */
 	if (RelationNeedsWAL(relation))
 	{
-		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
-											 newbuf, heaptup,
-											 all_visible_cleared,
-											 all_visible_cleared_new);
+		XLogRecPtr	recptr;
+
+		recptr = log_heap_update(relation, buffer, oldtup.t_self,
+								 newbuf, heaptup, &oldtup,
+								 all_visible_cleared,
+								 all_visible_cleared_new);
 
 		if (newbuf != buffer)
 		{
@@ -3299,74 +3351,11 @@ static bool
 heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
 					   HeapTuple tup1, HeapTuple tup2)
 {
-	Datum		value1,
-				value2;
-	bool		isnull1,
-				isnull2;
-	Form_pg_attribute att;
+	Size		tup1_attr_len,
+				tup2_attr_len;
 
-	/*
-	 * If it's a whole-tuple reference, say "not equal".  It's not really
-	 * worth supporting this case, since it could only succeed after a no-op
-	 * update, which is hardly a case worth optimizing for.
-	 */
-	if (attrnum == 0)
-		return false;
-
-	/*
-	 * Likewise, automatically say "not equal" for any system attribute other
-	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
-	 * chain, or even to be set correctly yet in the new tuple.
-	 */
-	if (attrnum < 0)
-	{
-		if (attrnum != ObjectIdAttributeNumber &&
-			attrnum != TableOidAttributeNumber)
-			return false;
-	}
-
-	/*
-	 * Extract the corresponding values.  XXX this is pretty inefficient if
-	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
-	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
-	 * work for system columns ...
-	 */
-	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
-	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
-
-	/*
-	 * If one value is NULL and other is not, then they are certainly not
-	 * equal
-	 */
-	if (isnull1 != isnull2)
-		return false;
-
-	/*
-	 * If both are NULL, they can be considered equal.
-	 */
-	if (isnull1)
-		return true;
-
-	/*
-	 * We do simple binary comparison of the two datums.  This may be overly
-	 * strict because there can be multiple binary representations for the
-	 * same logical value.	But we should be OK as long as there are no false
-	 * positives.  Using a type-specific equality operator is messy because
-	 * there could be multiple notions of equality in different operator
-	 * classes; furthermore, we cannot safely invoke user-defined functions
-	 * while holding exclusive buffer lock.
-	 */
-	if (attrnum <= 0)
-	{
-		/* The only allowed system columns are OIDs, so do this */
-		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
-	}
-	else
-	{
-		Assert(attrnum <= tupdesc->natts);
-		att = tupdesc->attrs[attrnum - 1];
-		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
-	}
+	return heap_attr_get_length_and_check_equals(tupdesc, attrnum, tup1, tup2,
+											 &tup1_attr_len, &tup2_attr_len);
 }
 
 /*
@@ -3417,7 +3406,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
 
 	result = heap_update(relation, otid, tup,
 						 GetCurrentCommandId(true), InvalidSnapshot,
-						 true /* wait for commit */,
+						 true /* wait for commit */ ,
 						 &hufd);
 	switch (result)
 	{
@@ -3504,7 +3493,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup)
 HTSU_Result
 heap_lock_tuple(Relation relation, HeapTuple tuple,
 				CommandId cid, LockTupleMode mode, bool nowait,
-				Buffer *buffer, HeapUpdateFailureData *hufd)
+				Buffer *buffer, HeapUpdateFailureData * hufd)
 {
 	HTSU_Result result;
 	ItemPointer tid = &(tuple->t_self);
@@ -4464,7 +4453,7 @@ log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
  */
 static XLogRecPtr
 log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
-				Buffer newbuf, HeapTuple newtup,
+				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
 				bool all_visible_cleared, bool new_all_visible_cleared)
 {
 	xl_heap_update xlrec;
@@ -4473,6 +4462,17 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	int			oldtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds max output possible from the LZ algorithm */
+	struct
+	{
+		PGLZ_Header pglzheader;
+		char		buf[MaxHeapTupleSize];
+	}			buf;
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -4482,11 +4482,41 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/* Is the update is going to the same page? */
+	if (oldbuf == newbuf)
+	{
+		/*
+		 * LZ algorithm can hold only history offset in the range of 1 - 4095.
+		 * so the delta encode is restricted for the tuples with length more
+		 * than PGLZ_HISTORY_SIZE.
+		 */
+		if (oldtuplen < PGLZ_HISTORY_SIZE)
+		{
+			/* Delta-encode the new tuple using the old tuple */
+			if (heap_delta_encode(reln->rd_att, oldtup, newtup,
+								  &buf.pglzheader))
+			{
+				compressed = true;
+				newtupdata = (char *) &buf.pglzheader;
+				newtuplen = VARSIZE(&buf.pglzheader);
+			}
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = from;
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -4513,9 +4543,12 @@ log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -5291,7 +5324,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -5306,7 +5342,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -5366,7 +5402,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
 						  HEAP_XMAX_INVALID |
@@ -5385,7 +5421,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -5410,7 +5446,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -5473,10 +5509,29 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the new tuple was delta-encoded, decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/* PG93FORMAT: LZ header + Encoded data */
+		PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
+
+		oldtup.t_data = oldtupdata;
+		newtup.t_data = htup;
+
+		heap_delta_decode(encoded_data, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -5491,7 +5546,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 466982e..d836b51 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -182,9 +182,6 @@
  */
 #define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
 #define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
-#define PGLZ_HISTORY_SIZE		4096
-#define PGLZ_MAX_MATCH			273
-
 
 /* ----------
  * PGLZ_HistEntry -
@@ -302,67 +299,6 @@ do {									\
 			}																\
 } while (0)
 
-
-/* ----------
- * pglz_out_ctrl -
- *
- *		Outputs the last and allocates a new control byte if needed.
- * ----------
- */
-#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
-do { \
-	if ((__ctrl & 0xff) == 0)												\
-	{																		\
-		*(__ctrlp) = __ctrlb;												\
-		__ctrlp = (__buf)++;												\
-		__ctrlb = 0;														\
-		__ctrl = 1;															\
-	}																		\
-} while (0)
-
-
-/* ----------
- * pglz_out_literal -
- *
- *		Outputs a literal byte to the destination buffer including the
- *		appropriate control bit.
- * ----------
- */
-#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	*(_buf)++ = (unsigned char)(_byte);										\
-	_ctrl <<= 1;															\
-} while (0)
-
-
-/* ----------
- * pglz_out_tag -
- *
- *		Outputs a backward reference tag of 2-4 bytes (depending on
- *		offset and length) to the destination buffer including the
- *		appropriate control bit.
- * ----------
- */
-#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
-do { \
-	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
-	_ctrlb |= _ctrl;														\
-	_ctrl <<= 1;															\
-	if (_len > 17)															\
-	{																		\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
-		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
-		(_buf)[2] = (unsigned char)((_len) - 18);							\
-		(_buf) += 3;														\
-	} else {																\
-		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
-		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
-		(_buf) += 2;														\
-	}																		\
-} while (0)
-
-
 /* ----------
  * pglz_find_match -
  *
@@ -595,7 +531,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			 * Create the tag and add history entries for all matched
 			 * characters.
 			 */
-			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off, dp);
 			while (match_len--)
 			{
 				pglz_hist_add(hist_start, hist_entries,
@@ -647,15 +583,38 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 void
 pglz_decompress(const PGLZ_Header *source, char *dest)
 {
+	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+}
+
+/* ----------
+ * pglz_decompress_with_history -
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+							 const char *history)
+{
+	PGLZ_Header src;
 	const unsigned char *sp;
 	const unsigned char *srcend;
 	unsigned char *dp;
 	unsigned char *destend;
 
+	/* To avoid the unaligned access of PGLZ_Header */
+	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+
 	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
-	srcend = ((const unsigned char *) source) + VARSIZE(source);
+	srcend = ((const unsigned char *) source) + VARSIZE(&src);
 	dp = (unsigned char *) dest;
-	destend = dp + source->rawsize;
+	destend = dp + src.rawsize;
+
+	if (destlen)
+	{
+		*destlen = src.rawsize;
+	}
 
 	while (sp < srcend && dp < destend)
 	{
@@ -699,28 +658,76 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 					break;
 				}
 
-				/*
-				 * Now we copy the bytes specified by the tag from OUTPUT to
-				 * OUTPUT. It is dangerous and platform dependent to use
-				 * memcpy() here, because the copied areas could overlap
-				 * extremely!
-				 */
-				while (len--)
+				if (history)
+				{
+					/*
+					 * Now we copy the bytes specified by the tag from history to
+					 * OUTPUT.
+					 */
+					memcpy(dp, history + off, len);
+					dp += len;
+				}
+				else
 				{
-					*dp = dp[-off];
-					dp++;
+					/*
+					 * Now we copy the bytes specified by the tag from OUTPUT to
+					 * OUTPUT. It is dangerous and platform dependent to use
+					 * memcpy() here, because the copied areas could overlap
+					 * extremely!
+					 */
+					while (len--)
+					{
+						*dp = dp[-off];
+						dp++;
+					}
 				}
 			}
 			else
 			{
-				/*
-				 * An unset control bit means LITERAL BYTE. So we just copy
-				 * one from INPUT to OUTPUT.
-				 */
-				if (dp >= destend)		/* check for buffer overrun */
-					break;		/* do not clobber memory */
-
-				*dp++ = *sp++;
+				if (history)
+				{
+					/*
+					 * The byte at current offset in the source is the length
+					 * of this literal segment. See pglz_out_add for encoding
+					 * side.
+					 */
+					int32		len;
+
+					len = sp[0];
+					sp += 1;
+
+					/*
+					 * Check for output buffer overrun, to ensure we don't clobber
+					 * memory in case of corrupt input.  Note: we must advance dp
+					 * here to ensure the error is detected below the loop.  We
+					 * don't simply put the elog inside the loop since that will
+					 * probably interfere with optimization.
+					 */
+					if (dp + len > destend)
+					{
+						dp += len;
+						break;
+					}
+
+					/*
+					 * Now we copy the bytes specified by the tag from Source to
+					 * OUTPUT.
+					 */
+					memcpy(dp, sp, len);
+					dp += len;
+					sp += len;
+				}
+				else
+				{
+					/*
+					 * An unset control bit means LITERAL BYTE. So we just copy
+					 * one from INPUT to OUTPUT.
+					 */
+					if (dp >= destend)		/* check for buffer overrun */
+						break;		/* do not clobber memory */
+
+					*dp++ = *sp++;
+				}
 			}
 
 			/*
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 8ec710e..3e4001f 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -142,12 +142,20 @@ typedef struct xl_heap_update
 {
 	xl_heaptid	target;			/* deleted tuple id */
 	ItemPointerData newtid;		/* new inserted tuple id */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	char		flags;			/* flag bits, see below */
+
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01 /* Indicates as old page's
+												all visible bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02 /* Indicates as new page's
+												all visible bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04 /* Indicates as the update
+												operation is delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 7abe3e6..4419fc4 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -18,6 +18,7 @@
 #include "access/tupdesc.h"
 #include "access/tupmacs.h"
 #include "storage/bufpage.h"
+#include "utils/pg_lzcompress.h"
 
 /*
  * MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
@@ -528,6 +529,7 @@ struct MinimalTupleData
 		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
 
 
+#if !defined(DISABLE_COMPLEX_MACRO)
 /* ----------------
  *		fastgetattr
  *
@@ -542,9 +544,6 @@ struct MinimalTupleData
  *		lookups, and call nocachegetattr() for the rest.
  * ----------------
  */
-
-#if !defined(DISABLE_COMPLEX_MACRO)
-
 #define fastgetattr(tup, attnum, tupleDesc, isnull)					\
 (																	\
 	AssertMacro((attnum) > 0),										\
@@ -572,14 +571,56 @@ struct MinimalTupleData
 			nocachegetattr((tup), (attnum), (tupleDesc))			\
 		)															\
 	)																\
+)																	\
+
+/* ----------------
+ *		fastgetattr_with_len
+ *
+ *		Similar to fastgetattr and fetches the length of the given attribute
+ *		also.
+ * ----------------
+ */
+#define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len)	\
+(																	\
+	AssertMacro((attnum) > 0),										\
+	(*(isnull) = false),											\
+	HeapTupleNoNulls(tup) ? 										\
+	(																\
+		(tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ?			\
+		(															\
+			(*(len) = att_getlength(								\
+					(tupleDesc)->attrs[(attnum)-1]->attlen, 		\
+					(char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)),	\
+			fetchatt((tupleDesc)->attrs[(attnum)-1],				\
+				(char *) (tup)->t_data + (tup)->t_data->t_hoff +	\
+					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)	\
+		)															\
+		:															\
+			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+	)																\
+	:																\
+	(																\
+		att_isnull((attnum)-1, (tup)->t_data->t_bits) ? 			\
+		(															\
+			(*(isnull) = true), 									\
+			(*(len) = 0),											\
+			(Datum)NULL 											\
+		)															\
+		:															\
+		(															\
+			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+		)															\
+	)																\
 )
-#else							/* defined(DISABLE_COMPLEX_MACRO) */
 
+#else							/* defined(DISABLE_COMPLEX_MACRO) */
 extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
 			bool *isnull);
+extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+			TupleDesc tupleDesc, bool *isnull, int32 *len);
 #endif   /* defined(DISABLE_COMPLEX_MACRO) */
 
-
 /* ----------------
  *		heap_getattr
  *
@@ -596,21 +637,43 @@ extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  * ----------------
  */
 #define heap_getattr(tup, attnum, tupleDesc, isnull) \
+( \
+	((attnum) > 0) ? \
 	( \
-		((attnum) > 0) ? \
+		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
 		( \
-			((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
-			( \
-				(*(isnull) = true), \
-				(Datum)NULL \
-			) \
-			: \
-				fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
+			(*(isnull) = true), \
+			(Datum)NULL \
 		) \
 		: \
-			heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
-	)
+			fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
+	) \
+	: \
+		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+)
 
+/* ----------------
+ *		heap_getattr_with_len
+ *
+ *		Similar to heap_getattr and outputs the length of the given attribute.
+ * ----------------
+ */
+#define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+( \
+	((attnum) > 0) ? \
+	( \
+		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+		( \
+			(*(isnull) = true), \
+			(*(len) = 0), \
+			(Datum)NULL \
+		) \
+		: \
+			fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+	) \
+	: \
+		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+)
 
 /* prototypes for functions in common/heaptuple.c */
 extern Size heap_compute_data_size(TupleDesc tupleDesc,
@@ -620,6 +683,8 @@ extern void heap_fill_tuple(TupleDesc tupleDesc,
 				char *data, Size data_size,
 				uint16 *infomask, bits8 *bit);
 extern bool heap_attisnull(HeapTuple tup, int attnum);
+extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+			   TupleDesc att, Size *len);
 extern Datum nocachegetattr(HeapTuple tup, int attnum,
 			   TupleDesc att);
 extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
@@ -636,6 +701,14 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_attr_get_length_and_check_equals(TupleDesc tupdesc,
+						int attrnum, HeapTuple tup1, HeapTuple tup2,
+						Size *tup1_attr_len, Size *tup2_attr_len);
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				HeapTuple newtup, PGLZ_Header *encdata);
+extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/tupmacs.h b/src/include/access/tupmacs.h
index 984a049..c1a27f7 100644
--- a/src/include/access/tupmacs.h
+++ b/src/include/access/tupmacs.h
@@ -187,6 +187,28 @@
 )
 
 /*
+ * att_getlength -
+ *			Gets the length of the attribute.
+ */
+#define att_getlength(attlen, attptr) \
+( \
+	((attlen) > 0) ? \
+	( \
+		(attlen) \
+	) \
+	: (((attlen) == -1) ? \
+	( \
+		VARSIZE_ANY(attptr) \
+	) \
+	: \
+	( \
+		AssertMacro((attlen) == -2), \
+		(strlen((char *) (attptr)) + 1) \
+	)) \
+)
+
+
+/*
  * store_att_byval is a partial inverse of fetch_att: store a given Datum
  * value into a tuple data area at the specified address.  However, it only
  * handles the byval case, because in typical usage the caller needs to
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..7b9d588 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -23,6 +23,8 @@ typedef struct PGLZ_Header
 	int32		rawsize;
 } PGLZ_Header;
 
+#define PGLZ_HISTORY_SIZE		4096
+#define PGLZ_MAX_MATCH			273
 
 /* ----------
  * PGLZ_MAX_OUTPUT -
@@ -86,6 +88,119 @@ typedef struct PGLZ_Strategy
 	int32		match_size_drop;
 } PGLZ_Strategy;
 
+/*
+ * calculate the approximate length required for history encode tag for the
+ * given length
+ */
+#define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+( \
+	((_len) < 17) ?	(3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH)))	\
+)
+
+/* ----------
+ * pglz_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1; 														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pglz_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte); 									\
+	_ctrl <<= 1;															\
+} while (0)
+
+/* ----------
+ * pglz_out_tag -
+ *
+ *		Outputs a backward/history reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ *
+ *		Split the process of backward/history reference as different chunks,
+ *		if the given lenght is more than max match and repeats the process
+ *		until the given length is processed.
+ *
+ *		If the matched history length is less than 3 bytes then add it as a
+ *		new data only during encoding instead of history reference. This occurs
+ *		only while framing delta record for wal update operation.
+ * ----------
+ */
+#define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_byte) \
+do { \
+	int _mlen;																	\
+	int _total_len = (_len);													\
+	while (_total_len > 0)														\
+	{																			\
+		_mlen = _total_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _total_len;		\
+		if (_mlen < 3)															\
+		{																		\
+			(_byte) = (char *)(_byte) + (_off);									\
+			pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_mlen,(_byte));				\
+			break;																\
+		}																		\
+		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+		_ctrlb |= _ctrl;														\
+		_ctrl <<= 1;															\
+		if (_mlen > 17)															\
+		{																		\
+			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+			(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+			(_buf)[2] = (unsigned char)((_mlen) - 18);							\
+			(_buf) += 3;														\
+		} else {																\
+			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mlen) - 3)); \
+			(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+			(_buf) += 2;														\
+		}																		\
+		_total_len -= _mlen;													\
+		(_off) += _mlen;														\
+	}																			\
+} while (0)
+
+/* ----------
+ * pglz_out_add -
+ *
+ *		Outputs a reference tag of 1 byte with length and the new data
+ *		to the destination buffer, including the appropriate control bit.
+ * ----------
+ */
+#define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+do { \
+	int32 _mlen;																\
+	int32 _total_len = (_len);															\
+	while (_total_len > 0)														\
+	{																		\
+		_mlen = _total_len > 255 ? 255 : _total_len;								\
+		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);							\
+		_ctrl <<= 1;														\
+		(_buf)[0] = (unsigned char)(_mlen);									\
+		(_buf) += 1;														\
+		memcpy((_buf), (_byte), _mlen);										\
+		(_buf) += _mlen;													\
+		(_byte) += _mlen;													\
+		_total_len -= _mlen;													\
+	}																		\
+} while (0)
+
 
 /* ----------
  * The standard strategies
@@ -108,5 +223,6 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
-
+extern void pglz_decompress_with_history(const char *source, char *dest,
+				uint32 *destlen, const char *history);
 #endif   /* _PG_LZCOMPRESS_H_ */

#10

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Kyotaro HORIGUCHI (#9)

On 28 December 2012 08:07, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

I've had a quick review of the patch to see how close we're getting.
The perf tests look to me like we're getting what we wanted from this
and I'm happy with the recovery performance trade-offs. Well done to
both author and testers.

My comments

* There is a fixed 75% heuristic in the patch. Can we document where
that came from? Can we have a parameter that sets that please? This
can be used to have further tests to confirm the useful setting of
this. I expect it to be removed before we release, but it will help
during beta.

* The compression algorithm depends completely upon new row length
savings. If the new row is short, it would seem easier to just skip
the checks and include it anyway. We can say if old and new vary in
length by > 50% of each other, just include new as-is, since the rows
very clearly differ in a big way. Also, if tuple is same length as
before, can we compare the whole tuple at once to save doing
per-column checks?

* If full page writes is on and the page is very old, we are just
going to copy the whole block. So why not check for that rather than
do all these push ups and then just copy the page anyway?

* TOAST is not handled at all. No comments about it, nothing. Does
that mean it hasn't been considered? Or did we decide not to care in
this release? Presumably that means we are comparing toast pointers
byte by byte to see if they are the same?

* I'd like to see a specific test in regression that is designed to
exercise the code here. That way we will be certain that the code is
getting regularly tested.

* The internal docs are completely absent. We need at least a whole
page of descriptive comment, discussing trade-offs and design
decisions. This is very important because it will help locate bugs
much faster if these things are clealry documented. It also helps
reviewers. This is a big timewaster for committers because you have to
read the whole patch and understand it before you can attempt to form
opinions. Commits happen quicker and easier with good comments.

* Lots of typos in comments. Many comments say nothing more than the
words already used in the function name itself

* "flags" variables are almost always int or uint in PG source.

* PGLZ_HISTORY_SIZE needs to be documented in the place it is defined,
not the place its used. The test if (oldtuplen < PGLZ_HISTORY_SIZE)
really needs to be a test inside the compression module to maintain
better modularity, so the value itself needn't be exported

* Need to mention the WAL format change, or include the change within
the patch so we can see

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Amit Kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Simon Riggs (#10)

On Friday, December 28, 2012 3:52 PM Simon Riggs wrote:

On 28 December 2012 08:07, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

I've had a quick review of the patch to see how close we're getting.
The perf tests look to me like we're getting what we wanted from this
and I'm happy with the recovery performance trade-offs. Well done to
both author and testers.

My comments

* There is a fixed 75% heuristic in the patch. Can we document where
that came from?

It is from LZ compression strategy. Refer PGLZ_Strategy.
I will add comment for it.

Can we have a parameter that sets that please? This
can be used to have further tests to confirm the useful setting of
this. I expect it to be removed before we release, but it will help
during beta.

I shall add that for test purpose.

* The compression algorithm depends completely upon new row length
savings. If the new row is short, it would seem easier to just skip
the checks and include it anyway. We can say if old and new vary in
length by > 50% of each other, just include new as-is, since the rows
very clearly differ in a big way.

I think it makes more sense. So I shall update the patch.

Also, if tuple is same length as
before, can we compare the whole tuple at once to save doing
per-column checks?

I shall evaluate and discuss with you.

* If full page writes is on and the page is very old, we are just
going to copy the whole block. So why not check for that rather than
do all these push ups and then just copy the page anyway?

I shall check once and update the patch.

* TOAST is not handled at all. No comments about it, nothing. Does
that mean it hasn't been considered? Or did we decide not to care in
this release?

Presumably that means we are comparing toast pointers
byte by byte to see if they are the same?

Yes, currently this patch is doing byte by byte comparison for toast
pointers. I shall add comment.
In future, we can evaluate if further optimizations can be done.

* I'd like to see a specific test in regression that is designed to
exercise the code here. That way we will be certain that the code is
getting regularly tested.

I shall add more specific tests.

* The internal docs are completely absent. We need at least a whole
page of descriptive comment, discussing trade-offs and design
decisions. This is very important because it will help locate bugs
much faster if these things are clealry documented. It also helps
reviewers. This is a big timewaster for committers because you have to
read the whole patch and understand it before you can attempt to form
opinions. Commits happen quicker and easier with good comments.

Do you have any suggestion for where to put this information, any particular
ReadMe?

* Lots of typos in comments. Many comments say nothing more than the
words already used in the function name itself

* "flags" variables are almost always int or uint in PG source.

* PGLZ_HISTORY_SIZE needs to be documented in the place it is defined,
not the place its used. The test if (oldtuplen < PGLZ_HISTORY_SIZE)
really needs to be a test inside the compression module to maintain
better modularity, so the value itself needn't be exported

I shall update the patch to address it.

* Need to mention the WAL format change, or include the change within
the patch so we can see

Sure, I will update this in code comments and internals docs.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Amit Kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Kyotaro HORIGUCHI (#9)

On Friday, December 28, 2012 1:38 PM Kyotaro HORIGUCHI wrote:

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

And took the performance figures for 4 types of modification versus 2
benchmarks.

As a whole, this patch brings very large gain in its effective range -
e.g. updates of relatively small portions in a tuple, but negligible
loss of performance is observed outside of its effective range on the
test machine. I suppose the losses will be emphasized by the more
higher performance of seq write of WAL devices

Thank you very much for the review.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Amit kapila (#4)

On 28 December 2012 11:27, Amit Kapila <amit.kapila@huawei.com> wrote:

* The internal docs are completely absent. We need at least a whole
page of descriptive comment, discussing trade-offs and design
decisions. This is very important because it will help locate bugs
much faster if these things are clealry documented. It also helps
reviewers. This is a big timewaster for committers because you have to
read the whole patch and understand it before you can attempt to form
opinions. Commits happen quicker and easier with good comments.

Do you have any suggestion for where to put this information, any particular
ReadMe?

Location is less relevant, since it will show up as additions in the patch.

Put it wherever makes most sense in comparison to existing related
comments/README. I have no preference myself.

If its any consolation, I notice a common issue with patches is lack
of *explanatory* comments, as opposed to line by line comments. So
same review comment to 50-75% of patches I've reviewed recently, which
is also likely why.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 8638713774901956152@unknownmsgid

#14

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Amit kapila (#4)

On 28 December 2012 11:27, Amit Kapila <amit.kapila@huawei.com> wrote:

* TOAST is not handled at all. No comments about it, nothing. Does
that mean it hasn't been considered? Or did we decide not to care in
this release?

Presumably that means we are comparing toast pointers
byte by byte to see if they are the same?

Yes, currently this patch is doing byte by byte comparison for toast
pointers. I shall add comment.
In future, we can evaluate if further optimizations can be done.

Just a comment to say that the comparison takes place after TOASTed
columns have been removed. TOAST is already optimised for whole value
UPDATE anyway, so that is the right place to produce the delta.

It does make me think that we can further optimise TOAST by updating
only the parts of a toasted datum that have changed. That will be
useful for JSON and XML applications that change only a portion of
large documents.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 8638713774901956152@unknownmsgid

#15

Amit Kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Simon Riggs (#10)

On Friday, December 28, 2012 3:52 PM Simon Riggs wrote:

On 28 December 2012 08:07, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

I've had a quick review of the patch to see how close we're getting.
The perf tests look to me like we're getting what we wanted from this
and I'm happy with the recovery performance trade-offs. Well done to
both author and testers.

* The compression algorithm depends completely upon new row length
savings. If the new row is short, it would seem easier to just skip
the checks and include it anyway. We can say if old and new vary in
length by > 50% of each other, just include new as-is, since the rows
very clearly differ in a big way.

Also, if tuple is same length as
before, can we compare the whole tuple at once to save doing
per-column checks?

If we have to do whole tuple comparison then storing of changed parts might
need to be
be done in a byte-by-byte way rather then at column offset boundaries.
This might not be possible with current algorithm as it stores in WAL
information column-by-column and decrypts also in similar way.

The internal docs are completely absent. We need at least a whole page of

descriptive > comment, discussing trade-offs and design decisions.

Currently I have planned to put it transam/README, as most of WAL
description is present there.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Amit kapila (#4)

On 4 January 2013 13:53, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, December 28, 2012 3:52 PM Simon Riggs wrote:

On 28 December 2012 08:07, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

I've had a quick review of the patch to see how close we're getting.
The perf tests look to me like we're getting what we wanted from this
and I'm happy with the recovery performance trade-offs. Well done to
both author and testers.

* The compression algorithm depends completely upon new row length
savings. If the new row is short, it would seem easier to just skip
the checks and include it anyway. We can say if old and new vary in
length by > 50% of each other, just include new as-is, since the rows
very clearly differ in a big way.

Also, if tuple is same length as
before, can we compare the whole tuple at once to save doing
per-column checks?

If we have to do whole tuple comparison then storing of changed parts might
need to be
be done in a byte-by-byte way rather then at column offset boundaries.
This might not be possible with current algorithm as it stores in WAL
information column-by-column and decrypts also in similar way.

OK, please explain in comments.

The internal docs are completely absent. We need at least a whole page of

descriptive > comment, discussing trade-offs and design decisions.

Currently I have planned to put it transam/README, as most of WAL
description is present there.

But also in comments for each major function.

Thanks

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 2977313435894718506@unknownmsgid

#17

Amit kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Simon Riggs (#16)

1 attachment(s)

On Friday, January 04, 2013 8:03 PM Simon Riggs wrote:
On 4 January 2013 13:53, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, December 28, 2012 3:52 PM Simon Riggs wrote:

On 28 December 2012 08:07, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

I've had a quick review of the patch to see how close we're getting.
The perf tests look to me like we're getting what we wanted from this
and I'm happy with the recovery performance trade-offs. Well done to
both author and testers.

Update patch contains handling of below Comments

* There is a fixed 75% heuristic in the patch. Can we document where
that came from? Can we have a parameter that sets that please? This
can be used to have further tests to confirm the useful setting of
this. I expect it to be removed before we release, but it will help
during beta.

Added a guc variable wal_update_compression_ratio to set the compression ratio.
It can be removed during beta.

* The compression algorithm depends completely upon new row length
savings. If the new row is short, it would seem easier to just skip
the checks and include it anyway. We can say if old and new vary in
length by > 50% of each other, just include new as-is, since the rows
very clearly differ in a big way.

Added a check in heap_delta_encode to identify whether the tuples are differ in length by 50%.

* If full page writes is on and the page is very old, we are just
going to copy the whole block. So why not check for that rather than
do all these push ups and then just copy the page anyway?

Added a function which is used to identify whether the page needs a backup block or not.
based on the result the optimization is applied.

* I'd like to see a specific test in regression that is designed to
exercise the code here. That way we will be certain that the code is
getting regularly tested.

Added the regression tests which covers all the changes done for the optimization except recovery.

* The internal docs are completely absent. We need at least a whole
page of descriptive comment, discussing trade-offs and design
decisions. This is very important because it will help locate bugs
much faster if these things are clealry documented. It also helps
reviewers. This is a big timewaster for committers because you have to
read the whole patch and understand it before you can attempt to form
opinions. Commits happen quicker and easier with good comments.
* Need to mention the WAL format change, or include the change within
the patch so we can see

backend/access/transam/README is updated with details.

* Lots of typos in comments. Many comments say nothing more than the
words already used in the function name itself

corrected the typos and removed unnecessary comments.

* "flags" variables are almost always int or uint in PG source.

* PGLZ_HISTORY_SIZE needs to be documented in the place it is defined,
not the place its used. The test if (oldtuplen < PGLZ_HISTORY_SIZE)
really needs to be a test inside the compression module to maintain
better modularity, so the value itself needn't be exported

(oldtuplen < PGLZ_HISTORY_SIZE) validation is moved inside the heap_delta_encode
and updated the flags variable also.

Test results with modified pgbench (1800 record size) on the latest patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -WAL@-c2-
Head 831 4.17 GB 1416 7.13 GB
WAL modification 846 2.36 GB 1712 3.31 GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -WAL@-c8-
Head 2196 11.01 GB 2758 13.88 GB
WAL modification 3295 5.87 GB 5472 9.02 GB

With Regards,
Amit Kapila.

Attachments:

wal_update_changes_v7.patchapplication/octet-stream; name=wal_update_changes_v7.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,66 ****
--- 60,69 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
  
+ /* guc variable for delta record compression ratio for wal update */
+ int			wal_update_compression_ratio = 25;
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
  #define ATT_IS_PACKABLE(att) \
***************
*** 297,308 **** heap_attisnull(HeapTuple tup, int attnum)
  }
  
  /* ----------------
!  *		nocachegetattr
   *
!  *		This only gets called from fastgetattr() macro, in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
--- 300,312 ----
  }
  
  /* ----------------
!  *		nocachegetattr_with_len
   *
!  *		This only gets called in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor and
!  *		outputs the length of the attribute value.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
***************
*** 320,328 **** heap_attisnull(HeapTuple tup, int attnum)
   * ----------------
   */
  Datum
! nocachegetattr(HeapTuple tuple,
! 			   int attnum,
! 			   TupleDesc tupleDesc)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
--- 324,333 ----
   * ----------------
   */
  Datum
! nocachegetattr_with_len(HeapTuple tuple,
! 						int attnum,
! 						TupleDesc tupleDesc,
! 						Size *len)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
***************
*** 381,386 **** nocachegetattr(HeapTuple tuple,
--- 386,394 ----
  		 */
  		if (att[attnum]->attcacheoff >= 0)
  		{
+ 			if (len)
+ 				*len = att_getlength(att[attnum]->attlen,
+ 									 tp + att[attnum]->attcacheoff);
  			return fetchatt(att[attnum],
  							tp + att[attnum]->attcacheoff);
  		}
***************
*** 507,515 **** nocachegetattr(HeapTuple tuple,
--- 515,536 ----
  		}
  	}
  
+ 	if (len)
+ 		*len = att_getlength(att[attnum]->attlen, tp + off);
  	return fetchatt(att[attnum], tp + off);
  }
  
+ /*
+  *	nocachegetattr
+  */
+ Datum
+ nocachegetattr(HeapTuple tuple,
+ 			   int attnum,
+ 			   TupleDesc tupleDesc)
+ {
+ 	return nocachegetattr_with_len(tuple, attnum, tupleDesc, NULL);
+ }
+ 
  /* ----------------
   *		heap_getsysattr
   *
***************
*** 617,622 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 638,1036 ----
  	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
  }
  
+ /* ----------------
+  * heap_attr_get_length_and_check_equals
+  *
+  * 		Compares the specified attribute's value in both given tuples, outputs
+  *      the length of the given attribute in both tuples and returns the result
+  *      of the comparison.
+  * ----------------
+  */
+ bool
+ heap_attr_get_length_and_check_equals(TupleDesc tupdesc, int attrnum,
+ 									  HeapTuple tup1, HeapTuple tup2,
+ 									Size *tup1_attr_len, Size *tup2_attr_len)
+ {
+ 	Datum		value1,
+ 				value2;
+ 	bool		isnull1,
+ 				isnull2;
+ 	Form_pg_attribute att;
+ 
+ 	*tup1_attr_len = 0;
+ 	*tup2_attr_len = 0;
+ 
+ 	/*
+ 	 * If it's a whole-tuple reference, say "not equal".  It's not really
+ 	 * worth supporting this case, since it could only succeed after a no-op
+ 	 * update, which is hardly a case worth optimizing for.
+ 	 */
+ 	if (attrnum == 0)
+ 		return false;
+ 
+ 	/*
+ 	 * Likewise, automatically say "not equal" for any system attribute other
+ 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ 	 * chain, or even to be set correctly yet in the new tuple.
+ 	 */
+ 	if (attrnum < 0)
+ 	{
+ 		if (attrnum != ObjectIdAttributeNumber &&
+ 			attrnum != TableOidAttributeNumber)
+ 			return false;
+ 	}
+ 
+ 	/*
+ 	 * Extract the corresponding values.  XXX this is pretty inefficient if
+ 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
+ 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
+ 	 * work for system columns ...
+ 	 */
+ 	value1 = heap_getattr_with_len(tup1, attrnum, tupdesc, &isnull1, tup1_attr_len);
+ 	value2 = heap_getattr_with_len(tup2, attrnum, tupdesc, &isnull2, tup2_attr_len);
+ 
+ 	/*
+ 	 * If one value is NULL and other is not, then they are certainly not
+ 	 * equal
+ 	 */
+ 	if (isnull1 != isnull2)
+ 		return false;
+ 
+ 	/*
+ 	 * If both are NULL, they can be considered equal.
+ 	 */
+ 	if (isnull1)
+ 		return true;
+ 
+ 	/*
+ 	 * We do simple binary comparison of the two datums.  This may be overly
+ 	 * strict because there can be multiple binary representations for the
+ 	 * same logical value.	But we should be OK as long as there are no false
+ 	 * positives.  Using a type-specific equality operator is messy because
+ 	 * there could be multiple notions of equality in different operator
+ 	 * classes; furthermore, we cannot safely invoke user-defined functions
+ 	 * while holding exclusive buffer lock.
+ 	 */
+ 	if (attrnum <= 0)
+ 	{
+ 		/* The only allowed system columns are OIDs, so do this */
+ 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ 	}
+ 	else
+ 	{
+ 		Assert(attrnum <= tupdesc->natts);
+ 		att = tupdesc->attrs[attrnum - 1];
+ 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ 	}
+ }
+ 
+ /* ----------------
+  * heap_delta_encode
+  *
+  *		Forms an delta encode data from old and new tuple with the modified
+  *		columns by comparing the column data and encode it as follows.
+  *
+  *		Header + Control byte + history reference (2 - 3)bytes
+  *		+ New data (1 byte length + variable data)+ ...
+  *
+  *
+  *      calculate the max encoded output data length which is 75%
+  *      [which is default compression rate]of original data.
+  *
+  *		Copy the bitmap data from new tuple to the encoded data buffer and
+  *      Loop for all attributes to find any modifications in the attributes.
+  *      The unmodified data is encoded as a history tag to the output and
+  *      the modified data is encoded as new data to the output.
+  *
+  *      History tag:
+  *		If any column is modified then the unmodified columns data till the
+  *      modified column needs to be copied to encode data buffer as a history
+  *      tag. The offset values are calculated with respect to the tuple t_hoff
+  *      value and then recalculate the old and new tuple offsets based on
+  *      padding in the tuples.
+  *
+  *      Modified data:
+  *      Copy the modified column data to the output buffer if exists and
+  *      calculate the old and new tuples next column start position, required
+  *      for handling if any alignment is present. Once the alignment difference
+  *      is found between old and new tuples, need to verify the last attribute
+  *      value of the new tuple is same as old tuple then encode till last
+  *      column as history data until the current match and also write the
+  *      alignment difference also as anew data to encode buffer.
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  PGLZ_Header *encdata)
+ {
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	int			numberOfAttributes;
+ 	int32		new_tup_off = 0,
+ 				old_tup_off = 0,
+ 				temp_off = 0,
+ 				match_off = 0,
+ 				change_off = 0;
+ 	int			attnum;
+ 	int32		data_len,
+ 				old_tup_pad_len,
+ 				new_tup_pad_len;
+ 	Size		old_tup_attr_len,
+ 				new_tup_attr_len;
+ 	bool		is_attr_equals = true;
+ 	unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+ 	unsigned char *bstart = bp;
+ 	char	   *dp = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	char	   *dstart = dp;
+ 	char	   *history;
+ 	unsigned char ctrl_dummy = 0;
+ 	unsigned char *ctrlp = &ctrl_dummy;
+ 	unsigned char ctrlb = 0;
+ 	unsigned char ctrl = 0;
+ 	int32		len,
+ 				old_tup_bitmaplen,
+ 				new_tup_bitmaplen,
+ 				old_tup_len,
+ 				new_tup_len;
+ 	int32		result_size;
+ 	int32		result_max;
+ 
+ 	old_tup_len = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * Tuples of length less than PGLZ_HISTORY_SIZE are allowed for delta
+ 	 * encode
+ 	 */
+ 	if (old_tup_len >= PGLZ_HISTORY_SIZE)
+ 		return false;
+ 
+ 	/* Include the bitmap header in the lz encoded data. */
+ 	history = (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	old_tup_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/* If old and new vary in length by more than 50%, include new as-is */
+ 	if ((new_tup_len <= (old_tup_len >> 1))
+ 		|| (old_tup_len <= (new_tup_len >> 1)))
+ 		return false;
+ 
+ 	/* Required compression ratio of the encoded data */
+ 	result_max = (new_tup_len * (100 - wal_update_compression_ratio)) / 100;
+ 	encdata->rawsize = new_tup_len;
+ 
+ 	/*
+ 	 * Check for output buffer is reached the result_max by advancing the
+ 	 * buffer by the calculated approximate length for the corresponding
+ 	 * operation.
+ 	 */
+ 	if ((bp + (2 * new_tup_bitmaplen)) - bstart >= result_max)
+ 		return false;
+ 
+ 	/* Copy the bitmap data from new tuple to the encoded data buffer */
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_bitmaplen, dp);
+ 	dstart = dp;
+ 
+ 	numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+ 	for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+ 	{
+ 		/*
+ 		 * If the attribute is modified by the update operation, store the
+ 		 * appropriate offsets in the WAL record, otherwise skip to the next
+ 		 * attribute.
+ 		 */
+ 		if (!heap_attr_get_length_and_check_equals(tupleDesc, attnum, oldtup,
+ 							   newtup, &old_tup_attr_len, &new_tup_attr_len))
+ 		{
+ 			is_attr_equals = false;
+ 			data_len = old_tup_off - match_off;
+ 
+ 			len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 			if ((bp + len) - bstart >= result_max)
+ 				return false;
+ 
+ 			/*
+ 			 * The match_off value is calculated w.r.t to the tuple t_hoff
+ 			 * value, the bit map len needs to be added to match_off to get
+ 			 * the actual start offset from the old/history tuple.
+ 			 */
+ 			match_off += old_tup_bitmaplen;
+ 
+ 			/*
+ 			 * If any unchanged data presents in the old and new tuples then
+ 			 * encode the data as it needs to copy from history tuple with len
+ 			 * and offset.
+ 			 */
+ 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 
+ 			/*
+ 			 * Recalculate the old and new tuple offsets based on padding in
+ 			 * the tuples
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				old_tup_off = att_align_pointer(old_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ 			}
+ 
+ 			if (!HeapTupleHasNulls(newtup)
+ 				|| !att_isnull((attnum - 1), newtup->t_data->t_bits))
+ 			{
+ 				new_tup_off = att_align_pointer(new_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ 			}
+ 
+ 			old_tup_off += old_tup_attr_len;
+ 			new_tup_off += new_tup_attr_len;
+ 
+ 			match_off = old_tup_off;
+ 		}
+ 		else
+ 		{
+ 			data_len = new_tup_off - change_off;
+ 			if ((bp + (2 * data_len)) - bstart >= result_max)
+ 				return false;
+ 
+ 			/* Copy the modified column data to the output buffer if present */
+ 			pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 			/*
+ 			 * calculate the old tuple field start position, required to
+ 			 * ignore if any alignment is present.
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				temp_off = old_tup_off;
+ 				old_tup_off = att_align_pointer(old_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ 
+ 				old_tup_pad_len = old_tup_off - temp_off;
+ 
+ 				/*
+ 				 * calculate the new tuple field start position to check
+ 				 * whether any padding is required or not because field
+ 				 * alignment.
+ 				 */
+ 				temp_off = new_tup_off;
+ 				new_tup_off = att_align_pointer(new_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ 				new_tup_pad_len = new_tup_off - temp_off;
+ 
+ 				/*
+ 				 * Checking for that is there any alignment difference between
+ 				 * old and new tuple attributes.
+ 				 */
+ 				if (old_tup_pad_len != new_tup_pad_len)
+ 				{
+ 					/*
+ 					 * If the alignment difference is found between old and
+ 					 * new tuples and the last attribute value of the new
+ 					 * tuple is same as old tuple then write the history tag
+ 					 * until the current match.
+ 					 */
+ 					if (is_attr_equals)
+ 					{
+ 						data_len = old_tup_off - old_tup_pad_len - match_off;
+ 						len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 						if ((bp + len) - bstart >= result_max)
+ 							return false;
+ 
+ 						match_off += old_tup_bitmaplen;
+ 						pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 					}
+ 
+ 					match_off = old_tup_off;
+ 
+ 					/* Alignment data */
+ 					if ((bp + (2 * new_tup_pad_len)) - bstart >= result_max)
+ 						return false;
+ 
+ 					pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_pad_len, dp);
+ 				}
+ 			}
+ 
+ 			old_tup_off += old_tup_attr_len;
+ 			new_tup_off += new_tup_attr_len;
+ 
+ 			change_off = new_tup_off;
+ 
+ 			/*
+ 			 * Recalculate the destination pointer with the new offset which
+ 			 * is used while copying the modified data.
+ 			 */
+ 			dp = dstart + new_tup_off;
+ 			is_attr_equals = true;
+ 		}
+ 	}
+ 
+ 	/* If any modified column data presents then copy it. */
+ 	data_len = new_tup_off - change_off;
+ 	if ((bp + (2 * data_len)) - bstart >= result_max)
+ 		return false;
+ 
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 	/* If any left out old tuple data presents then copy it as history */
+ 	data_len = old_tup_off - match_off;
+ 	len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 	if ((bp + len) - bstart >= result_max)
+ 		return false;
+ 
+ 	match_off += old_tup_bitmaplen;
+ 	pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 
+ 	/*
+ 	 * Write out the last control byte and check that we haven't overrun the
+ 	 * output size allowed by the strategy.
+ 	 */
+ 	*ctrlp = ctrlb;
+ 
+ 	result_size = bp - bstart;
+ 	if (result_size >= result_max)
+ 		return false;
+ 
+ 	/* Fill in the actual length of the compressed datum */
+ 	SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+ 	return true;
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *
+  *		Frames the original tuple which needs to be inserted into the heap by
+  *      decoding the WAL tuplewith the help of old Heap tuple.
+  *
+  *      Read one control byte and process the next 8 items (or as many as
+  *      remain in the compressed input).
+  *
+  *      History reference:
+  *      The next 2 - 3 byte tag provides the offset and length of history match.
+  *      From the offset with the corresponding length the old tuple data is
+  *      copied to the new tuple.
+  *
+  *      New data reference:
+  *      First byte contains the length [0-255] of the modified data, followed
+  *      by the modified data of corresponding length specified in the first byte.
+  * ----------------
+  */
+ void
+ heap_delta_decode(PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	return pglz_decompress_with_history((char *) encdata,
+ 			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 										&newtup->t_len,
+ 			(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
  /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 85,90 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
--- 85,91 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+ 				HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
***************
*** 857,862 **** heapgettup_pagemode(HeapScanDesc scan,
--- 858,911 ----
   * definition in access/htup.h is maintained.
   */
  Datum
+ fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+ 					 bool *isnull, int32 *len)
+ {
+ 	return (
+ 			(attnum) > 0 ?
+ 			(
+ 			 (*(isnull) = false),
+ 			 HeapTupleNoNulls(tup) ?
+ 			 (
+ 			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
+ 			  (
+ 			(*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
+ 							 (char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ 							 (tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
+ 			   fetchatt((tupleDesc)->attrs[(attnum) - 1],
+ 						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ 						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
+ 			   )
+ 			  :
+ 			  (
+ 			   nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
+ 			  )
+ 			 :
+ 			 (
+ 			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
+ 			  (
+ 			   (*(isnull) = true),
+ 			   (*(len) = 0),
+ 			   (Datum) NULL
+ 			   )
+ 			  :
+ 			  (
+ 			   nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))
+ 			   )
+ 			  )
+ 			 )
+ 			:
+ 			(
+ 			 (Datum) NULL
+ 			 )
+ 		);
+ }
+ 
+ /*
+  * This is formatted so oddly so that the correspondence to the macro
+  * definition in access/htup.h is maintained.
+  */
+ Datum
  fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull)
  {
***************
*** 873,879 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr((tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
--- 922,929 ----
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  (
! 			   nocachegetattr(tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
***************
*** 3229,3238 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3279,3290 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 
! 		recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 								 newbuf, heaptup, &oldtup,
! 								 all_visible_cleared,
! 								 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
***************
*** 3299,3372 **** static bool
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Datum		value1,
! 				value2;
! 	bool		isnull1,
! 				isnull2;
! 	Form_pg_attribute att;
  
! 	/*
! 	 * If it's a whole-tuple reference, say "not equal".  It's not really
! 	 * worth supporting this case, since it could only succeed after a no-op
! 	 * update, which is hardly a case worth optimizing for.
! 	 */
! 	if (attrnum == 0)
! 		return false;
! 
! 	/*
! 	 * Likewise, automatically say "not equal" for any system attribute other
! 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
! 	 * chain, or even to be set correctly yet in the new tuple.
! 	 */
! 	if (attrnum < 0)
! 	{
! 		if (attrnum != ObjectIdAttributeNumber &&
! 			attrnum != TableOidAttributeNumber)
! 			return false;
! 	}
! 
! 	/*
! 	 * Extract the corresponding values.  XXX this is pretty inefficient if
! 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
! 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
! 	 * work for system columns ...
! 	 */
! 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
! 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
! 
! 	/*
! 	 * If one value is NULL and other is not, then they are certainly not
! 	 * equal
! 	 */
! 	if (isnull1 != isnull2)
! 		return false;
! 
! 	/*
! 	 * If both are NULL, they can be considered equal.
! 	 */
! 	if (isnull1)
! 		return true;
! 
! 	/*
! 	 * We do simple binary comparison of the two datums.  This may be overly
! 	 * strict because there can be multiple binary representations for the
! 	 * same logical value.	But we should be OK as long as there are no false
! 	 * positives.  Using a type-specific equality operator is messy because
! 	 * there could be multiple notions of equality in different operator
! 	 * classes; furthermore, we cannot safely invoke user-defined functions
! 	 * while holding exclusive buffer lock.
! 	 */
! 	if (attrnum <= 0)
! 	{
! 		/* The only allowed system columns are OIDs, so do this */
! 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
! 	}
! 	else
! 	{
! 		Assert(attrnum <= tupdesc->natts);
! 		att = tupdesc->attrs[attrnum - 1];
! 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
! 	}
  }
  
  /*
--- 3351,3361 ----
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Size		tup1_attr_len,
! 				tup2_attr_len;
  
! 	return heap_attr_get_length_and_check_equals(tupdesc, attrnum, tup1, tup2,
! 											 &tup1_attr_len, &tup2_attr_len);
  }
  
  /*
***************
*** 4464,4470 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
--- 4453,4459 ----
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
***************
*** 4473,4478 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 4462,4477 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds max output possible from the LZ algorithm */
+ 	struct
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[MaxHeapTupleSize];
+ 	}			buf;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 4482,4492 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4481,4514 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * Do only the delta encode when the update is going to the same page and
+ 	 * buffer doesn't need a backup block in case of full-pagewrite is on.
+ 	 */
+ 	if ((oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ 	{
+ 		/* Delta-encode the new tuple using the old tuple */
+ 		if (heap_delta_encode(reln->rd_att, oldtup, newtup, &buf.pglzheader))
+ 		{
+ 			compressed = true;
+ 			newtupdata = (char *) &buf.pglzheader;
+ 			newtuplen = VARSIZE(&buf.pglzheader);
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 4513,4521 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 4535,4546 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! 	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 5291,5297 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5316,5325 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 5306,5312 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5334,5340 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5366,5372 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
--- 5394,5400 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
***************
*** 5385,5391 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5413,5419 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5410,5416 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5438,5444 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5473,5482 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 5501,5532 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the new tuple was delta-encoded, decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/*
! 		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
! 		 * + New data (1 byte length + variable data)+ ...
! 		 */
! 		PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
! 
! 		oldtup.t_data = oldtupdata;
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode(encoded_data, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 5491,5497 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 5541,5547 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
***************
*** 665,670 **** then restart recovery.  This is part of the reason for not writing a WAL
--- 665,786 ----
  entry until we've successfully done the original action.
  
  
+ Write-Ahead Logging for Update operation
+ ----------------------------------------
+ 
+ The delta tuples for update WAL is to eliminate copying the entire the new record
+ to WAL for the update operation.
+ 
+ 
+ Delta tuple format
+ ------------------
+ 
+ Header + Control byte + history reference (2 - 3)bytes
+ 	+ New data (1 byte length + variable data)+ ...
+ 
+ 
+ Header:
+ 
+ The header is same PGLZ_Header, which is used to store the compressed length and raw length.
+ 
+ Control byte:
+ 
+ The first byte after the header tells what to do the next 8 times. We call this the control byte.
+ 
+ 
+ history reference:
+ 
+ A set bit in the control byte means, that a tag of 2-3 bytes follows. A tag contains information
+ to copy some bytes, that are already in the old tuple, to the current location in the output.
+ Let's call the three tag bytes T1, T2 and T3. The position of the data to copy is coded as an offset
+ from the old tuple.
+ 
+ The offset is in the upper nibble of T1 and in T2.
+ The length is in the lower nibble of T1.
+ 
+ So the 16 bits of a 2 byte tag are coded as
+ 
+ 	7---T1--0  7---T2--0
+ 	OOOO LLLL  OOOO OOOO
+ 
+ please refer pg_lzcompress.c header for more details of history reference.
+ 
+ 
+ New data:
+ 
+ An unset bit in the control byte means, that one new byte follows,
+ which is copied from new tuple to delta tuple.
+ 
+ 	7---T1--0  7---T2--0  ...
+ 	LLLL LLLL  DDDD DDDD  ...
+ 
+     Data bytes repeat until the length of the new data.
+ 
+ 
+ L - Length
+ O - Offset
+ D - Data
+ 
+ 
+ heap_delta_encode
+ -----------------
+ 
+ calculate the max encoded output data length which is 75% [which is default compression rate]
+ of original data.
+ 
+ Copy the bitmap data from new tuple to the encoded data buffer and Loop for all attributes
+ to find any modifications in the attributes. The unmodified data is encoded as a history tag
+ to the output and the modified data is encoded as new data to the output.
+ 
+ 
+ History tag:
+ 
+ If any column is modified then the unmodified columns data till the modified column needs to be
+ copied to encode data buffer as a history tag. The offset values are calculated with respect to
+ the tuple t_hoff value and then recalculate the old and new tuple offsets based on padding in the tuples.
+ 
+ 
+ Modified data:
+ 
+ Copy the modified column data to the output buffer if present and calculate the old and new tuples
+ next column start position, required for handling if any alignment is present.
+ 
+ Once the alignment difference is found between old and new tuples, need to verify the
+ last attribute value of the new tuple is same as old tuple then encode till last column as
+ history data until the current match and also write the alignment difference also as a
+ new data to encode buffer.
+ 
+ 
+ heap_delta_decode
+ -----------------
+ 
+ Frames the original tuple which needs to be inserted into the heap by decoding the WAL tuple
+ with the help of old Heap tuple. To frame the tuple the following steps are carried out.
+ 
+ Read one control byte and process the next 8 items (or as many as remain in the compressed input).
+ 
+ History reference:
+ 
+ The next 2 - 3 byte tag provides the offset and length of history match.
+ From the offset with the corresponding length the old tuple data is copied to the new tuple.
+ 
+ please refer pg_lzcompress.c header for more details of history reference.
+ 
+ New data reference:
+ 
+ First byte contains the length [0-255] of the modified data, followed by the modified data of
+ corresponding length specified in the first byte.
+ 
+ 
+ Constraints
+ -----------
+ 
+ 1. Hot only tuples and also the buffers doesn't require a backup block
+    during WAL in case of full-pagewrite is on are allowed for encoding.
+ 2. Old Tuples with length less than PGLZ_HISTORY_SIZE are allowed for encoding.
+ 3. Tuples old and new shouldn't vary in length by more than 50% are allowed for encoding.
+ 
+ 
  Asynchronous Commit
  -------------------
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1206,1211 **** begin:;
--- 1206,1233 ----
  }
  
  /*
+  * Determine whether the buffer referenced has to be backed up? Since we don't
+  * yet have the insert lock, fullPageWrites and forcePageWrites could change
+  * later, but will not cause any problem because this function is used only to
+  * identify whether delta tuple is required for WAL update?
+  */
+ bool
+ XLogCheckBufferNeedsBackup(Buffer buffer)
+ {
+ 	bool doPageWrites;
+ 	Page page;
+ 
+ 	page = BufferGetPage(buffer);
+ 
+ 	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+ 
+ 	if (doPageWrites &&	PageGetLSN(page) <= RedoRecPtr)
+ 		return true;			/* buffer requires backup */
+ 
+ 	return false;				/* buffer does not need to be backed up */
+ }
+ 
+ /*
   * Determine whether the buffer referenced by an XLogRecData item has to
   * be backed up, and if so fill a BkpBlock struct for it.  In any case
   * save the buffer's LSN at *lsn.
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 182,190 ****
   */
  #define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
  #define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
- #define PGLZ_HISTORY_SIZE		4096
- #define PGLZ_MAX_MATCH			273
- 
  
  /* ----------
   * PGLZ_HistEntry -
--- 182,187 ----
***************
*** 302,368 **** do {									\
  			}																\
  } while (0)
  
- 
- /* ----------
-  * pglz_out_ctrl -
-  *
-  *		Outputs the last and allocates a new control byte if needed.
-  * ----------
-  */
- #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
- do { \
- 	if ((__ctrl & 0xff) == 0)												\
- 	{																		\
- 		*(__ctrlp) = __ctrlb;												\
- 		__ctrlp = (__buf)++;												\
- 		__ctrlb = 0;														\
- 		__ctrl = 1;															\
- 	}																		\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_literal -
-  *
-  *		Outputs a literal byte to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	*(_buf)++ = (unsigned char)(_byte);										\
- 	_ctrl <<= 1;															\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_tag -
-  *
-  *		Outputs a backward reference tag of 2-4 bytes (depending on
-  *		offset and length) to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	_ctrlb |= _ctrl;														\
- 	_ctrl <<= 1;															\
- 	if (_len > 17)															\
- 	{																		\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
- 		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
- 		(_buf)[2] = (unsigned char)((_len) - 18);							\
- 		(_buf) += 3;														\
- 	} else {																\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
- 		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
- 		(_buf) += 2;														\
- 	}																		\
- } while (0)
- 
- 
  /* ----------
   * pglz_find_match -
   *
--- 299,304 ----
***************
*** 595,601 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			 * Create the tag and add history entries for all matched
  			 * characters.
  			 */
! 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
  			while (match_len--)
  			{
  				pglz_hist_add(hist_start, hist_entries,
--- 531,537 ----
  			 * Create the tag and add history entries for all matched
  			 * characters.
  			 */
! 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off, dp);
  			while (match_len--)
  			{
  				pglz_hist_add(hist_start, hist_entries,
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 583,620 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest.
+  *		To decompress, it uses history if provided.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 699,726 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  					break;
  				}
  
! 				/*
! 				 * Now we copy the bytes specified by the tag from OUTPUT to
! 				 * OUTPUT. It is dangerous and platform dependent to use
! 				 * memcpy() here, because the copied areas could overlap
! 				 * extremely!
! 				 */
! 				while (len--)
  				{
! 					*dp = dp[-off];
! 					dp++;
  				}
  			}
  			else
  			{
! 				/*
! 				 * An unset control bit means LITERAL BYTE. So we just copy
! 				 * one from INPUT to OUTPUT.
! 				 */
! 				if (dp >= destend)		/* check for buffer overrun */
! 					break;		/* do not clobber memory */
! 
! 				*dp++ = *sp++;
  			}
  
  			/*
--- 658,726 ----
  					break;
  				}
  
! 				if (history)
! 				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from history
! 					 * to OUTPUT.
! 					 */
! 					memcpy(dp, history + off, len);
! 					dp += len;
! 				}
! 				else
  				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from OUTPUT
! 					 * to OUTPUT. It is dangerous and platform dependent to
! 					 * use memcpy() here, because the copied areas could
! 					 * overlap extremely!
! 					 */
! 					while (len--)
! 					{
! 						*dp = dp[-off];
! 						dp++;
! 					}
  				}
  			}
  			else
  			{
! 				if (history)
! 				{
! 					/*
! 					 * The byte at current offset in the source is the length
! 					 * of this literal segment. See pglz_out_add for encoding
! 					 * side.
! 					 */
! 					int32		len;
! 
! 					len = sp[0];
! 					sp += 1;
! 
! 					if (dp + len > destend)
! 					{
! 						dp += len;
! 						break;
! 					}
! 
! 					/*
! 					 * Now we copy the bytes specified by the tag from Source
! 					 * to OUTPUT.
! 					 */
! 					memcpy(dp, sp, len);
! 					dp += len;
! 					sp += len;
! 				}
! 				else
! 				{
! 					/*
! 					 * An unset control bit means LITERAL BYTE. So we just
! 					 * copy one from INPUT to OUTPUT.
! 					 */
! 					if (dp >= destend)	/* check for buffer overrun */
! 						break;	/* do not clobber memory */
! 
! 					*dp++ = *sp++;
! 				}
  			}
  
  			/*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 123,128 **** extern int	CommitSiblings;
--- 123,129 ----
  extern char *default_tablespace;
  extern char *temp_tablespaces;
  extern bool synchronize_seqscans;
+ extern int  wal_update_compression_ratio;
  extern int	ssl_renegotiation_limit;
  extern char *SSLCipherSuites;
  
***************
*** 2382,2387 **** static struct config_int ConfigureNamesInt[] =
--- 2383,2399 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		/* Not for general use */
+ 		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ 			gettext_noop("Sets the compression ratio of delta record for wal update"),
+ 			NULL,
+ 		},
+ 		&wal_update_compression_ratio,
+ 		25, 1, 99,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 142,161 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	int		flags;				/* flag bits, see below */
! 
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! 
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01 /* Indicates as old page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02 /* Indicates as new page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04 /* Indicates as the update
! 												operation is delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 18,23 ****
--- 18,24 ----
  #include "access/tupdesc.h"
  #include "access/tupmacs.h"
  #include "storage/bufpage.h"
+ #include "utils/pg_lzcompress.h"
  
  /*
   * MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
***************
*** 528,533 **** struct MinimalTupleData
--- 529,535 ----
  		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
  
  
+ #if !defined(DISABLE_COMPLEX_MACRO)
  /* ----------------
   *		fastgetattr
   *
***************
*** 542,550 **** struct MinimalTupleData
   *		lookups, and call nocachegetattr() for the rest.
   * ----------------
   */
- 
- #if !defined(DISABLE_COMPLEX_MACRO)
- 
  #define fastgetattr(tup, attnum, tupleDesc, isnull)					\
  (																	\
  	AssertMacro((attnum) > 0),										\
--- 544,549 ----
***************
*** 572,585 **** struct MinimalTupleData
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
  )
- #else							/* defined(DISABLE_COMPLEX_MACRO) */
  
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
- 
  /* ----------------
   *		heap_getattr
   *
--- 571,626 ----
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
+ )																	\
+ 
+ /* ----------------
+  *		fastgetattr_with_len
+  *
+  *		Similar to fastgetattr and fetches the length of the given attribute
+  *		also.
+  * ----------------
+  */
+ #define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len)	\
+ (																	\
+ 	AssertMacro((attnum) > 0),										\
+ 	(*(isnull) = false),											\
+ 	HeapTupleNoNulls(tup) ? 										\
+ 	(																\
+ 		(tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ?			\
+ 		(															\
+ 			(*(len) = att_getlength(								\
+ 					(tupleDesc)->attrs[(attnum)-1]->attlen, 		\
+ 					(char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)),	\
+ 			fetchatt((tupleDesc)->attrs[(attnum)-1],				\
+ 				(char *) (tup)->t_data + (tup)->t_data->t_hoff +	\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)	\
+ 		)															\
+ 		:															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 	)																\
+ 	:																\
+ 	(																\
+ 		att_isnull((attnum)-1, (tup)->t_data->t_bits) ? 			\
+ 		(															\
+ 			(*(isnull) = true), 									\
+ 			(*(len) = 0),											\
+ 			(Datum)NULL 											\
+ 		)															\
+ 		:															\
+ 		(															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 		)															\
+ 	)																\
  )
  
+ #else							/* defined(DISABLE_COMPLEX_MACRO) */
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
+ extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+ 			TupleDesc tupleDesc, bool *isnull, int32 *len);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  /* ----------------
   *		heap_getattr
   *
***************
*** 596,616 **** extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
  	( \
! 		((attnum) > 0) ? \
  		( \
! 			((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
! 			( \
! 				(*(isnull) = true), \
! 				(Datum)NULL \
! 			) \
! 			: \
! 				fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
  		) \
  		: \
! 			heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	)
  
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
--- 637,679 ----
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
+ ( \
+ 	((attnum) > 0) ? \
  	( \
! 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
  		( \
! 			(*(isnull) = true), \
! 			(Datum)NULL \
  		) \
  		: \
! 			fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	) \
! 	: \
! 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
  
+ /* ----------------
+  *		heap_getattr_with_len
+  *
+  *		Similar to heap_getattr and outputs the length of the given attribute.
+  * ----------------
+  */
+ #define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ 	((attnum) > 0) ? \
+ 	( \
+ 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+ 		( \
+ 			(*(isnull) = true), \
+ 			(*(len) = 0), \
+ 			(Datum)NULL \
+ 		) \
+ 		: \
+ 			fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+ 	) \
+ 	: \
+ 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+ )
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
***************
*** 620,625 **** extern void heap_fill_tuple(TupleDesc tupleDesc,
--- 683,690 ----
  				char *data, Size data_size,
  				uint16 *infomask, bits8 *bit);
  extern bool heap_attisnull(HeapTuple tup, int attnum);
+ extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+ 			   TupleDesc att, Size *len);
  extern Datum nocachegetattr(HeapTuple tup, int attnum,
  			   TupleDesc att);
  extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
***************
*** 636,641 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 701,714 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_attr_get_length_and_check_equals(TupleDesc tupdesc,
+ 						int attrnum, HeapTuple tup1, HeapTuple tup2,
+ 						Size *tup1_attr_len, Size *tup2_attr_len);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				HeapTuple newtup, PGLZ_Header *encdata);
+ extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+ 				HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/tupmacs.h
--- b/src/include/access/tupmacs.h
***************
*** 187,192 ****
--- 187,214 ----
  )
  
  /*
+  * att_getlength -
+  *			Gets the length of the attribute.
+  */
+ #define att_getlength(attlen, attptr) \
+ ( \
+ 	((attlen) > 0) ? \
+ 	( \
+ 		(attlen) \
+ 	) \
+ 	: (((attlen) == -1) ? \
+ 	( \
+ 		VARSIZE_ANY(attptr) \
+ 	) \
+ 	: \
+ 	( \
+ 		AssertMacro((attlen) == -2), \
+ 		(strlen((char *) (attptr)) + 1) \
+ 	)) \
+ )
+ 
+ 
+ /*
   * store_att_byval is a partial inverse of fetch_att: store a given Datum
   * value into a tuple data area at the specified address.  However, it only
   * handles the byval case, because in typical usage the caller needs to
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 261,266 **** typedef struct CheckpointStatsData
--- 261,267 ----
  extern CheckpointStatsData CheckpointStats;
  
  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+ extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
  extern void XLogFlush(XLogRecPtr RecPtr);
  extern bool XLogBackgroundFlush(void);
  extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 23,28 **** typedef struct PGLZ_Header
--- 23,31 ----
  	int32		rawsize;
  } PGLZ_Header;
  
+ /* LZ algorithm can hold only history offset in the range of 1 - 4095. */
+ #define PGLZ_HISTORY_SIZE		4096
+ #define PGLZ_MAX_MATCH			273
  
  /* ----------
   * PGLZ_MAX_OUTPUT -
***************
*** 86,91 **** typedef struct PGLZ_Strategy
--- 89,207 ----
  	int32		match_size_drop;
  } PGLZ_Strategy;
  
+ /*
+  * calculate the approximate length required for history encode tag for the
+  * given length
+  */
+ #define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+ ( \
+ 	((_len) < 17) ?	(3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH)))	\
+ )
+ 
+ /* ----------
+  * pglz_out_ctrl -
+  *
+  *		Outputs the last and allocates a new control byte if needed.
+  * ----------
+  */
+ #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+ do { \
+ 	if ((__ctrl & 0xff) == 0)												\
+ 	{																		\
+ 		*(__ctrlp) = __ctrlb;												\
+ 		__ctrlp = (__buf)++;												\
+ 		__ctrlb = 0;														\
+ 		__ctrl = 1; 														\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_literal -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+ do { \
+ 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 	*(_buf)++ = (unsigned char)(_byte); 									\
+ 	_ctrl <<= 1;															\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_tag -
+  *
+  *		Outputs a backward/history reference tag of 2-4 bytes (depending on
+  *		offset and length) to the destination buffer including the
+  *		appropriate control bit.
+  *
+  *		Split the process of backward/history reference as different chunks,
+  *		if the given length is more than max match and repeats the process
+  *		until the given length is processed.
+  *
+  *		If the matched history length is less than 3 bytes then add it as a
+  *		new data only during encoding instead of history reference. This occurs
+  *		only while framing delta record for wal update operation.
+  * ----------
+  */
+ #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_byte) \
+ do { \
+ 	int _mtaglen;															\
+ 	int _tagtotal_len = (_len);												\
+ 	while (_tagtotal_len > 0)												\
+ 	{																		\
+ 		_mtaglen = _tagtotal_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _tagtotal_len;	\
+ 		if (_mtaglen < 3)													\
+ 		{																	\
+ 			char *_data = (char *)(_byte) + (_off);							\
+ 			pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_mtaglen,_data);			\
+ 			break;															\
+ 		}																	\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);							\
+ 		_ctrlb |= _ctrl;													\
+ 		_ctrl <<= 1;														\
+ 		if (_mtaglen > 17)													\
+ 		{																	\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);	\
+ 			(_buf)[1] = (unsigned char)(((_off) & 0xff));					\
+ 			(_buf)[2] = (unsigned char)((_mtaglen) - 18);					\
+ 			(_buf) += 3;													\
+ 		} else {															\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mtaglen) - 3)); \
+ 			(_buf)[1] = (unsigned char)((_off) & 0xff);						\
+ 			(_buf) += 2;													\
+ 		}																	\
+ 		_tagtotal_len -= _mtaglen;											\
+ 		(_off) += _mtaglen;													\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_add -
+  *
+  *		Outputs a reference tag of 1 byte with length and the new data
+  *		to the destination buffer, including the appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ 	int32 _maddlen;															\
+ 	int32 _addtotal_len = (_len);											\
+ 	while (_addtotal_len > 0)												\
+ 	{																		\
+ 		_maddlen = _addtotal_len > 255 ? 255 : _addtotal_len;				\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);							\
+ 		_ctrl <<= 1;														\
+ 		(_buf)[0] = (unsigned char)(_maddlen);								\
+ 		(_buf) += 1;														\
+ 		memcpy((_buf), (_byte), _maddlen);									\
+ 		(_buf) += _maddlen;													\
+ 		(_byte) += _maddlen;												\
+ 		_addtotal_len -= _maddlen;											\
+ 	}																		\
+ } while (0)
+ 
  
  /* ----------
   * The standard strategies
***************
*** 108,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! 
  #endif   /* _PG_LZCOMPRESS_H_ */
--- 224,229 ----
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! extern void pglz_decompress_with_history(const char *source, char *dest,
! 				uint32 *destlen, const char *history);
  #endif   /* _PG_LZCOMPRESS_H_ */
*** a/src/test/regress/expected/update.out
--- b/src/test/regress/expected/update.out
***************
*** 97,99 **** SELECT a, b, char_length(c) FROM update_test;
--- 97,169 ----
  (2 rows)
  
  DROP TABLE update_test;
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ DROP TABLE IF EXISTS update_test;
+ NOTICE:  table "update_test" does not exist, skipping
+ CREATE TABLE update_test (
+ 		bser bigserial,
+ 		bln boolean,
+ 		ename VARCHAR(25),
+ 		perf_f float(8),
+ 		grade CHAR,
+ 		dept CHAR(5) NOT NULL,
+ 		dob DATE,
+ 		idnum INT,
+ 		addr VARCHAR(30) NOT NULL,
+ 		destn CHAR(6),
+ 		Gend CHAR,
+ 		samba BIGINT,
+ 		hgt float,
+ 		ctime TIME
+ );
+ INSERT INTO update_test VALUES (
+ 		nextval('update_test_bser_seq'::regclass),
+ 		TRUE,
+ 		'Test',
+ 		7.169,
+ 		'B',
+ 		'CSD',
+ 		'2000-01-01',
+ 		520,
+ 		'road2,
+ 		streeeeet2,
+ 		city2',
+ 		'dcy2',
+ 		'M',
+ 		12000,
+ 		50.4,
+ 		'00:00:00.0'
+ );
+ SELECT * from update_test;
+  bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+     1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+       |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+       |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+ (1 row)
+ 
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ SELECT * from update_test;
+  bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+     1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+       |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+       |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+ (1 row)
+ 
+ DROP TABLE update_test;
*** a/src/test/regress/sql/update.sql
--- b/src/test/regress/sql/update.sql
***************
*** 59,61 **** UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
--- 59,128 ----
  SELECT a, b, char_length(c) FROM update_test;
  
  DROP TABLE update_test;
+ 
+ 
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ 
+ DROP TABLE IF EXISTS update_test;
+ CREATE TABLE update_test (
+ 		bser bigserial,
+ 		bln boolean,
+ 		ename VARCHAR(25),
+ 		perf_f float(8),
+ 		grade CHAR,
+ 		dept CHAR(5) NOT NULL,
+ 		dob DATE,
+ 		idnum INT,
+ 		addr VARCHAR(30) NOT NULL,
+ 		destn CHAR(6),
+ 		Gend CHAR,
+ 		samba BIGINT,
+ 		hgt float,
+ 		ctime TIME
+ );
+ 
+ INSERT INTO update_test VALUES (
+ 		nextval('update_test_bser_seq'::regclass),
+ 		TRUE,
+ 		'Test',
+ 		7.169,
+ 		'B',
+ 		'CSD',
+ 		'2000-01-01',
+ 		520,
+ 		'road2,
+ 		streeeeet2,
+ 		city2',
+ 		'dcy2',
+ 		'M',
+ 		12000,
+ 		50.4,
+ 		'00:00:00.0'
+ );
+ 
+ SELECT * from update_test;
+ 
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ 
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ 
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ 
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ 
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ 
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ 
+ SELECT * from update_test;
+ DROP TABLE update_test;

#18

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Amit kapila (#17)

On 9 January 2013 08:05, Amit kapila <amit.kapila@huawei.com> wrote:

Update patch contains handling of below Comments

Thanks

Test results with modified pgbench (1800 record size) on the latest patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -WAL@-c2-
Head 831 4.17 GB 1416 7.13 GB
WAL modification 846 2.36 GB 1712 3.31 GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -WAL@-c8-
Head 2196 11.01 GB 2758 13.88 GB
WAL modification 3295 5.87 GB 5472 9.02 GB

And test results on normal pgbench?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Amit Kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Simon Riggs (#18)

On Wednesday, January 09, 2013 4:57 PM Simon Riggs wrote:

On 9 January 2013 08:05, Amit kapila <amit.kapila@huawei.com> wrote:

Update patch contains handling of below Comments

Thanks

Test results with modified pgbench (1800 record size) on the latest

patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -

WAL@-c2-

Head 831 4.17 GB 1416 7.13

GB

WAL modification 846 2.36 GB 1712 3.31

GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -

WAL@-c8-

Head 2196 11.01 GB 2758 13.88

GB

WAL modification 3295 5.87 GB 5472 9.02

GB

And test results on normal pgbench?

As there was no gain for original pgbench as was shown in performance
readings, so I thought it is not mandatory.
However I shall run for normal pgbench as it should not lead any further dip
in normal pgbench.
Thanks for pointing.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Amit Kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Simon Riggs (#18)

On Wednesday, January 09, 2013 4:57 PM Simon Riggs wrote:

On 9 January 2013 08:05, Amit kapila <amit.kapila@huawei.com> wrote:

Update patch contains handling of below Comments

Thanks

Test results with modified pgbench (1800 record size) on the latest

patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -

WAL@-c2-

Head 831 4.17 GB 1416 7.13

GB

WAL modification 846 2.36 GB 1712 3.31

GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -

WAL@-c8-

Head 2196 11.01 GB 2758 13.88

GB

WAL modification 3295 5.87 GB 5472 9.02

GB

And test results on normal pgbench?

configuration:

shared_buffers = 4GB
wal_buffers = 16MB
checkpoint_segments = 256
checkpoint_interval = 15min
autovacuum = off
server_encoding = SQL_ASCII
client_encoding = UTF8
lc_collate = C
lc_ctype = C

init:

pgbench -s 75 -i -F 80

run:

pgbench -T 600

Test results with original pgbench (synccommit off) on the latest patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -WAL@-c2-
Head 1459 1.40 GB 2491 1.70 GB
WAL modification 1558 1.38 GB 2441 1.59 GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -WAL@-c8-
Head 5139 2.49 GB 10651 4.72 GB
WAL modification 5224 2.28 GB 11329 3.96 GB

Test results with original pgbench (synccommit on) on the latest patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -WAL@-c2-
Head 146 0.45 GB 167 0.49 GB
WAL modification 144 0.44 GB 166 0.49 GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -WAL@-c8-
Head 325 0.77 GB 603 1.03 GB
WAL modification 321 0.76 GB 604 1.01 GB

The results are similar as noted by Kyotaro-San. The WAL size is reduced
even for original pgbench.
There is slight performance dip in some of the cases for original pgbench.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Amit kapila (#4)

On 11 January 2013 10:40, Amit Kapila <amit.kapila@huawei.com> wrote:

Test results with original pgbench (synccommit off) on the latest patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -WAL@-c2-
Head 1459 1.40 GB 2491 1.70 GB
WAL modification 1558 1.38 GB 2441 1.59 GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -WAL@-c8-
Head 5139 2.49 GB 10651 4.72 GB
WAL modification 5224 2.28 GB 11329 3.96 GB

There is slight performance dip in some of the cases for original pgbench.

Is this just one run? Can we see 3 runs please?

Can we investigate the performance dip at c=2?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 2916732300902601660@unknownmsgid

#22

Amit Kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Simon Riggs (#21)

On Friday, January 11, 2013 4:28 PM Simon Riggs wrote:

On 11 January 2013 10:40, Amit Kapila <amit.kapila@huawei.com> wrote:

Test results with original pgbench (synccommit off) on the latest

patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -

WAL@-c2-

Head 1459 1.40 GB 2491 1.70

GB

WAL modification 1558 1.38 GB 2441 1.59

GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -

WAL@-c8-

Head 5139 2.49 GB 10651 4.72

GB

WAL modification 5224 2.28 GB 11329 3.96

GB

There is slight performance dip in some of the cases for original

pgbench.

Is this just one run? Can we see 3 runs please?

This average of 3 runs.

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -WAL@-c2-
Head-1 1648 1.47 GB 2491 1.69 GB
Head-2 1538 1.43 GB 2529 1.72 GB
Head-3 1192 1.31 GB 2453 1.70 GB

AvgHead 1459 1.40 GB 2491 1.70 GB

WAL modification-1 1618 1.40 GB 2351 1.56
GB
WAL modification-2 1623 1.40 GB 2411 1.59
GB
WAL modification-3 1435 1.34 GB 2562 1.61
GB

WAL modification-Avg 1558 1.38 GB 2441 1.59
GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -WAL@-c8-
Head-1 5285 2.53 GB 11858 5.43
GB
Head-2 5105 2.47 GB 10724 4.98
GB
Head-3 5029 2.46 GB 9372 3.75
GB

AvgHead 5139 2.49 GB 10651 4.72
GB

WAL modification-1 5117 2.26 GB 12092 4.42
GB
WAL modification-2 5142 2.26 GB 9965 3.48
GB
WAL modification-3 5413 2.33 GB 11930 3.99
GB

WAL modification-Avg 5224 2.28 GB 11329 3.96
GB

Can we investigate the performance dip at c=2?

Please consider following points for this dip:
1. For synchronous commit = off, there is always slight variation in data.
2. The size of WAL is reduced.
3. For small rows (128 bytes), sometimes the performance difference
created by this algorithm doesn't help much,
as the size is not reduced significantly and there is equivalent
overhead for delta compression.
We can put check that this optimization should be applied if row length
is greater than some
threshold(128 bytes, 200 bytes), but I feel as performance dip is not
much and WAL reduction gain is also
there, so it should be okay without any check as well.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Amit kapila (#4)

On 11 January 2013 12:30, Amit Kapila <amit.kapila@huawei.com> wrote:

Is this just one run? Can we see 3 runs please?

This average of 3 runs.

The results are so variable its almost impossible to draw any
conclusions at all. I think if we did harder stats on those we'd get
nothing.

Can you do something to bring that in? Or just do more tests to get a
better view?

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -WAL@-c2-
Head-1 1648 1.47 GB 2491 1.69 GB
Head-2 1538 1.43 GB 2529 1.72 GB
Head-3 1192 1.31 GB 2453 1.70 GB

AvgHead 1459 1.40 GB 2491 1.70 GB

WAL modification-1 1618 1.40 GB 2351 1.56
GB
WAL modification-2 1623 1.40 GB 2411 1.59
GB
WAL modification-3 1435 1.34 GB 2562 1.61
GB

WAL modification-Avg 1558 1.38 GB 2441 1.59
GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -WAL@-c8-
Head-1 5285 2.53 GB 11858 5.43
GB
Head-2 5105 2.47 GB 10724 4.98
GB
Head-3 5029 2.46 GB 9372 3.75
GB

AvgHead 5139 2.49 GB 10651 4.72
GB

WAL modification-1 5117 2.26 GB 12092 4.42
GB
WAL modification-2 5142 2.26 GB 9965 3.48
GB
WAL modification-3 5413 2.33 GB 11930 3.99
GB

WAL modification-Avg 5224 2.28 GB 11329 3.96
GB

Can we investigate the performance dip at c=2?

Please consider following points for this dip:
1. For synchronous commit = off, there is always slight variation in data.
2. The size of WAL is reduced.
3. For small rows (128 bytes), sometimes the performance difference
created by this algorithm doesn't help much,
as the size is not reduced significantly and there is equivalent
overhead for delta compression.
We can put check that this optimization should be applied if row length
is greater than some
threshold(128 bytes, 200 bytes), but I feel as performance dip is not
much and WAL reduction gain is also
there, so it should be okay without any check as well.

With Regards,
Amit Kapila.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 8716145491517350021@unknownmsgid

#24

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Simon Riggs (#10)

On 28 December 2012 10:21, Simon Riggs <simon@2ndquadrant.com> wrote:

* There is a fixed 75% heuristic in the patch.

I'm concerned that we're doing extra work while holding the buffer
locked, which will exacerbate any block contention that exists.

We have a list of the columns that the UPDATE is touching since we use
that to check column permissions for the UPDATE. Which means we should
be able to use that list to check only the columns actually changing
in this UPDATE statement.

That will likely save us some time during the compression check.

Can you look into that please? I don't think it will be much work.

I've moved this to the next CF. I'm planning to review this one first.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Amit Kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Simon Riggs (#23)

On Friday, January 11, 2013 6:18 PM Simon Riggs wrote:

On 11 January 2013 12:30, Amit Kapila <amit.kapila@huawei.com> wrote:

Is this just one run? Can we see 3 runs please?

This average of 3 runs.

The results are so variable its almost impossible to draw any
conclusions at all. I think if we did harder stats on those we'd get
nothing.

Can you do something to bring that in? Or just do more tests to get a
better view?

To be honest, I have tried this set of 3 readings 2 times and there is
similar fluctuation for sync commit =off
What I can do is early next week,
a. I can run this test for 10 times to see the results.
b. run the tests for record length-256 instead of 128

However I think my results of sync commit = on is matching with Kyotaro-San.

Please suggest if you have anything in mind?

This is for sync mode= off, if see the result on sync mode= on, it is
comparatively consistent.
I think for sync commit = off, there is always fluctuation in results.
The sync mode= on, results are as below:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -WAL@-c2-
Head-1 149 0.46 GB 160 0.48
GB
Head-2 145 0.45 GB 180 0.52
GB
Head-3 144 0.45 GB 161 0.48
GB

WAL modification-1 142 0.44 GB 161 0.48 GB
WAL modification-2 146 1.45 GB 162 0.48 GB
WAL modification-3 144 1.44 GB 175 0.51 GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -WAL@-c8-
Head-1 325 0.77 GB 602 1.03
GB
Head-2 328 0.77 GB 606 1.03
GB
Head-3 323 0.77 GB 603 1.03
GB

WAL modification-1 324 0.76 GB 604 1.01 GB
WAL modification-2 322 0.76 GB 604 1.01 GB
WAL modification-3 317 0.75 GB 604 1.01 GB

Can we investigate the performance dip at c=2?

Please consider following points for this dip:
1. For synchronous commit = off, there is always slight variation

in data.

2. The size of WAL is reduced.
3. For small rows (128 bytes), sometimes the performance difference
created by this algorithm doesn't help much,
as the size is not reduced significantly and there is equivalent
overhead for delta compression.
We can put check that this optimization should be applied if row

length

is greater than some
threshold(128 bytes, 200 bytes), but I feel as performance dip

is not

much and WAL reduction gain is also
there, so it should be okay without any check as well.

With Regards,
Amit Kapila.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Amit Kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Simon Riggs (#24)

On Friday, January 11, 2013 6:45 PM Simon Riggs wrote:

On 28 December 2012 10:21, Simon Riggs <simon@2ndquadrant.com> wrote:

* There is a fixed 75% heuristic in the patch.

I'm concerned that we're doing extra work while holding the buffer
locked, which will exacerbate any block contention that exists.

We have a list of the columns that the UPDATE is touching since we use
that to check column permissions for the UPDATE. Which means we should
be able to use that list to check only the columns actually changing
in this UPDATE statement.

That will likely save us some time during the compression check.

Can you look into that please? I don't think it will be much work.

IIUC, I have done that way in the initial version of the patch that is do
encoding for modified columns.
I have mentioned reference of my initial patch as below:

modifiedCols = (rt_fetch(resultRelInfo->ri_RangeTableIndex,
+
estate->es_range_table)->modifiedCols);

http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C3828
52DE51@szxeml509-mbs

1. However Heikki has pointed, it has some problems similar to for HOT
implementation and that is the reason we have done memcmp for HOT.
2. Also we have found in initial readings that this doesn't have any
performance difference as compare to current Approach.

I've moved this to the next CF. I'm planning to review this one first.

Thank you.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Alvaro Herrera

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Simon Riggs (#24)

Simon Riggs wrote:

On 28 December 2012 10:21, Simon Riggs <simon@2ndquadrant.com> wrote:

* There is a fixed 75% heuristic in the patch.

I'm concerned that we're doing extra work while holding the buffer
locked, which will exacerbate any block contention that exists.

We have a list of the columns that the UPDATE is touching since we use
that to check column permissions for the UPDATE. Which means we should
be able to use that list to check only the columns actually changing
in this UPDATE statement.

But that doesn't include columns changed by triggers, AFAIR, so you
could only use that if there weren't any triggers.

I was also worried about the high variance in the results. Those
averages look rather meaningless. Which would be okay, I think, because
it'd mean that performance-wise the patch is a wash, but it is still
achieving a lower WAL volume, which is good.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Alvaro Herrera (#27)

On 11 January 2013 14:29, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

But that doesn't include columns changed by triggers, AFAIR, so you
could only use that if there weren't any triggers.

True, well spotted

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Simon Riggs

simon@2ndquadrant.com

about 13 years ago

In reply to: Amit kapila (#4)

On 11 January 2013 14:24, Amit Kapila <amit.kapila@huawei.com> wrote:

http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C3828
52DE51@szxeml509-mbs

1. However Heikki has pointed, it has some problems similar to for HOT
implementation and that is the reason we have done memcmp for HOT.
2. Also we have found in initial readings that this doesn't have any
performance difference as compare to current Approach.

OK, forget that idea.

I've moved this to the next CF. I'm planning to review this one first.

Thank you.

Just reviewing the patch now, making more sense with comments added.

In heap_delta_encode() do we store which columns have changed? Do we
store the whole new column value?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 8803574858474369734@unknownmsgid

#30

Amit kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Simon Riggs (#29)

On Friday, January 11, 2013 9:27 PM Simon Riggs wrote:
On 11 January 2013 14:24, Amit Kapila <amit.kapila@huawei.com> wrote:

http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C3828
52DE51@szxeml509-mbs

1. However Heikki has pointed, it has some problems similar to for HOT
implementation and that is the reason we have done memcmp for HOT.
2. Also we have found in initial readings that this doesn't have any
performance difference as compare to current Approach.

OK, forget that idea.

I've moved this to the next CF. I'm planning to review this one first.

Thank you.

Just reviewing the patch now, making more sense with comments added.

In heap_delta_encode() do we store which columns have changed?

Not the attribute bumberwise, but offsetwise it is stored.

Do we store the whole new column value?

Yes, please refer else part of code

+ 		else
+ 		{
+ 			data_len = new_tup_off - change_off;
+ 			if ((bp + (2 * data_len)) - bstart >= result_max)
+ 				return false;
+ 
+ 			/* Copy the modified column data to the output buffer if present */
+ 			pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Amit kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Alvaro Herrera (#27)

On Friday, January 11, 2013 7:59 PM Alvaro Herrera wrote:
Simon Riggs wrote:

On 28 December 2012 10:21, Simon Riggs <simon@2ndquadrant.com> wrote:

I was also worried about the high variance in the results. Those
averages look rather meaningless. Which would be okay, I think, because
it'd mean that performance-wise the patch is a wash,

For larger tuple sizes (>1000 && < 1800), the performance gain will be good.
Please refer performance results by me and Kyotaro-san in below links:

http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BEAAE32@szxeml509-mbx
http://archives.postgresql.org/message-id/20121228.170748.90887322.horiguchi.kyotaro@lab.ntt.co.jp

In fact, I believe for all tuples with length between 200 to 1800 bytes and changed values around 15~20%, there will be both performance gain as well as WAL reduction.
The reason for keeping the logic same for smaller tuples (<=128 bytes) also same, that there is no much performance difference but still WAL reduction gain is visible.

but it is still achieving a lower WAL volume, which is good.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Amit kapila (#30)

On 11 January 2013 17:08, Amit kapila <amit.kapila@huawei.com> wrote:

Just reviewing the patch now, making more sense with comments added.

In heap_delta_encode() do we store which columns have changed?

Not the attribute bumberwise, but offsetwise it is stored.

(Does that mean "numberwise"??)

Can we identify which columns have changed? i.e. 1st, 3rd and 12th columns?

Do we store the whole new column value?

Yes, please refer else part of code

+               else
+               {
+                       data_len = new_tup_off - change_off;
+                       if ((bp + (2 * data_len)) - bstart >= result_max)
+                               return false;
+
+                       /* Copy the modified column data to the output buffer if present */
+                       pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+

"modified column data" could mean either 1) (modified column) data
i.e. the data for the modified column, or 2) modified (column data)
i.e. the modified data in the column. I read that as (2) and didn't
look at the code. ;-)

Happy now that I know its (1)

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Amit kapila (#31)

On 11 January 2013 17:30, Amit kapila <amit.kapila@huawei.com> wrote:

On Friday, January 11, 2013 7:59 PM Alvaro Herrera wrote:
Simon Riggs wrote:

On 28 December 2012 10:21, Simon Riggs <simon@2ndquadrant.com> wrote:

I was also worried about the high variance in the results. Those
averages look rather meaningless. Which would be okay, I think, because
it'd mean that performance-wise the patch is a wash,

For larger tuple sizes (>1000 && < 1800), the performance gain will be good.
Please refer performance results by me and Kyotaro-san in below links:

http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BEAAE32@szxeml509-mbx
http://archives.postgresql.org/message-id/20121228.170748.90887322.horiguchi.kyotaro@lab.ntt.co.jp

AFAICS your tests are badly variable, but as Alvaro says, they aren't
accurate enough to tell there's a regression.

I'll assume not and carry on.

(BTW the rejection of the null bitmap patch because of a performance
regression may also need to be reconsidered).

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Amit kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Simon Riggs (#32)

On Friday, January 11, 2013 11:09 PM Simon Riggs wrote:
On 11 January 2013 17:08, Amit kapila <amit.kapila@huawei.com> wrote:

Just reviewing the patch now, making more sense with comments added.

In heap_delta_encode() do we store which columns have changed?

Not the attribute bumberwise, but offsetwise it is stored.

(Does that mean "numberwise"??)

Yes.

Can we identify which columns have changed? i.e. 1st, 3rd and 12th columns?

As per current algorithm, we can't as it is based on offsets.
What I mean to say is that the basic idea to reconstruct tuple during recovery
is copy data from old tuple offset-wise (offsets stored in encoded tuple) and use new data (modified column data)
from encoded tuple directly. So we don't need exact column numbers.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Amit kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Simon Riggs (#33)

On Friday, January 11, 2013 11:12 PM Simon Riggs wrote:
On 11 January 2013 17:30, Amit kapila <amit.kapila@huawei.com> wrote:

On Friday, January 11, 2013 7:59 PM Alvaro Herrera wrote:
Simon Riggs wrote:

On 28 December 2012 10:21, Simon Riggs <simon@2ndquadrant.com> wrote:

I was also worried about the high variance in the results. Those
averages look rather meaningless. Which would be okay, I think, because
it'd mean that performance-wise the patch is a wash,

For larger tuple sizes (>1000 && < 1800), the performance gain will be good.
Please refer performance results by me and Kyotaro-san in below links:

http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BEAAE32@szxeml509-mbx
http://archives.postgresql.org/message-id/20121228.170748.90887322.horiguchi.kyotaro@lab.ntt.co.jp

AFAICS your tests are badly variable, but as Alvaro says, they aren't
accurate enough to tell there's a regression.

I'll assume not and carry on.

(BTW the rejection of the null bitmap patch because of a performance
regression may also need to be reconsidered).

I can post detailed numbers during next commit fest.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Amit kapila (#34)

On 11 January 2013 18:11, Amit kapila <amit.kapila@huawei.com> wrote:

Can we identify which columns have changed? i.e. 1st, 3rd and 12th columns?

As per current algorithm, we can't as it is based on offsets.
What I mean to say is that the basic idea to reconstruct tuple during recovery
is copy data from old tuple offset-wise (offsets stored in encoded tuple) and use new data (modified column data)
from encoded tuple directly. So we don't need exact column numbers.

Another patch is going through next CF related to reassembling changes
from WAL records.

To do that efficiently, we would want to store a bitmap showing which
columns had changed in each update. Would that be an easy addition, or
is that blocked by some aspect of the current design?

The idea would be that we could re-construct an UPDATE statement that
would perform exactly the same change, yet without needing to refer to
a base tuple.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Amit kapila

amit.kapila@huawei.com

about 13 years ago

In reply to: Simon Riggs (#36)

On Saturday, January 12, 2013 12:23 AM Simon Riggs wrote:
On 11 January 2013 18:11, Amit kapila <amit.kapila@huawei.com> wrote:

Can we identify which columns have changed? i.e. 1st, 3rd and 12th columns?

As per current algorithm, we can't as it is based on offsets.
What I mean to say is that the basic idea to reconstruct tuple during recovery
is copy data from old tuple offset-wise (offsets stored in encoded tuple) and use new data (modified column data)
from encoded tuple directly. So we don't need exact column numbers.

Another patch is going through next CF related to reassembling changes
from WAL records.

To do that efficiently, we would want to store a bitmap showing which
columns had changed in each update. Would that be an easy addition, or
is that blocked by some aspect of the current design?

I don't think it should be a problem, as it can go in current way of WAL tuple construction as
we do in this patch when old and new buf are different. This differentiation is done in
log_heap_update.

IMO, for now we can avoid this optimization (way we have done incase updated tuple is not on same page)
for the bitmap storing patch and later we can evaluate if we can do this optimization for
the feature of that patch.

The idea would be that we could re-construct an UPDATE statement that
would perform exactly the same change, yet without needing to refer to
a base tuple.

I understood, that such a functionality would be needed by logical replication.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Amit kapila (#37)

On 12 January 2013 03:50, Amit kapila <amit.kapila@huawei.com> wrote:

On Saturday, January 12, 2013 12:23 AM Simon Riggs wrote:
On 11 January 2013 18:11, Amit kapila <amit.kapila@huawei.com> wrote:

Can we identify which columns have changed? i.e. 1st, 3rd and 12th columns?

As per current algorithm, we can't as it is based on offsets.
What I mean to say is that the basic idea to reconstruct tuple during recovery
is copy data from old tuple offset-wise (offsets stored in encoded tuple) and use new data (modified column data)
from encoded tuple directly. So we don't need exact column numbers.

Another patch is going through next CF related to reassembling changes
from WAL records.

To do that efficiently, we would want to store a bitmap showing which
columns had changed in each update. Would that be an easy addition, or
is that blocked by some aspect of the current design?

I don't think it should be a problem, as it can go in current way of WAL tuple construction as
we do in this patch when old and new buf are different. This differentiation is done in
log_heap_update.

IMO, for now we can avoid this optimization (way we have done incase updated tuple is not on same page)
for the bitmap storing patch and later we can evaluate if we can do this optimization for
the feature of that patch.

Yes, we can simply disable this feature. But that is just bad planning
and we should give some thought to having new features play nicely
together.

I would like to work out how to modify this so it can work with wal
decoding enabled. I know we can do this, I want to look at how,
because we know we're going to do it.

The idea would be that we could re-construct an UPDATE statement that
would perform exactly the same change, yet without needing to refer to
a base tuple.

I understood, that such a functionality would be needed by logical replication.

Yes, though the features being added are to allow decoding of WAL for
any purpose.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Simon Riggs (#29)

On 11 January 2013 15:57, Simon Riggs <simon@2ndquadrant.com> wrote:

I've moved this to the next CF. I'm planning to review this one first.

Thank you.

Just reviewing the patch now, making more sense with comments added.

Making more sense, but not yet making complete sense.

I'd like you to revisit the patch comments since some of them are
completely unreadable.

Examples

"Frames the original tuple which needs to be inserted into the heap by
decoding the WAL tuplewith the help of old Heap tuple."
"The delta tuples for update WAL is to eliminate copying the entire
the new record to WAL for the update operation."

I don't mind rewording the odd line here and there, that's just normal
editing, but this needs extensive work in terms of number of places
requiring change and the level of change at each place. That's not
easy for me to do when I'm trying to understand the patch in the first
place. My own written English isn't that great, so please read some of
the other comments in other parts of the code so you can see the level
of clarity that's needed in PostgreSQL.

Copying chunks of text from other comments doesn't help much either,
especially when you miss out parts of the explanation. You refer to a
"history tag" but don't define it that well, and don't explain why it
might sometimes be 3 bytes, or what that means. pg_lzcompress doesn't
call it that either, which is confusing. If you use a concept from
elsewhere you should either use the same name, or if you rename it,
rename it in both places.

/*
* Do only the delta encode when the update is going to the same page and
* buffer doesn't need a backup block in case of full-pagewrite is on.
*/
if ((oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))

The comment above says nothing. I can see that oldbuf and newbuf must
be the same and the call to XLogCheckBufferNeedsBackup is clear
because the function is already well named.

What I'd expect to see here is a discussion of why this test is being
applied and maybe why it is applied here. Such an important test
deserves a long discussion, perhaps 10-20 lines of comment.

Thanks

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Amit kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Simon Riggs (#38)

On Saturday, January 12, 2013 3:45 PM Simon Riggs wrote:
On 12 January 2013 03:50, Amit kapila <amit.kapila@huawei.com> wrote:

On Saturday, January 12, 2013 12:23 AM Simon Riggs wrote:
On 11 January 2013 18:11, Amit kapila <amit.kapila@huawei.com> wrote:

Can we identify which columns have changed? i.e. 1st, 3rd and 12th columns?

As per current algorithm, we can't as it is based on offsets.
What I mean to say is that the basic idea to reconstruct tuple during recovery
is copy data from old tuple offset-wise (offsets stored in encoded tuple) and use new data (modified column data)
from encoded tuple directly. So we don't need exact column numbers.

Another patch is going through next CF related to reassembling changes
from WAL records.

To do that efficiently, we would want to store a bitmap showing which
columns had changed in each update. Would that be an easy addition, or
is that blocked by some aspect of the current design?

I don't think it should be a problem, as it can go in current way of WAL tuple construction as
we do in this patch when old and new buf are different. This differentiation is done in
log_heap_update.

IMO, for now we can avoid this optimization (way we have done incase updated tuple is not on same page)
for the bitmap storing patch and later we can evaluate if we can do this optimization for
the feature of that patch.

Yes, we can simply disable this feature. But that is just bad planning
and we should give some thought to having new features play nicely
together.

I would like to work out how to modify this so it can work with wal
decoding enabled. I know we can do this, I want to look at how,
because we know we're going to do it.

I am sure this can be done, as for WAL decoding we mainly new values and column numbers
So if we include bitmap in WAL tuple and teach WAL decoding method how to decode this new format WAL tuple
it can be done.
However it will need changes in algorithm for both the patches and it can be risk for one or for both patches.
I am open to have discussion about how both can work together, but IMHO at this moment (as this will be last CF)
it will be little risky.
If there is some way such that with minor modifications, we can address this scenario, I will be happy to see both
working together.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Amit kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Simon Riggs (#39)

On Saturday, January 12, 2013 4:36 PM Simon Riggs wrote:
On 11 January 2013 15:57, Simon Riggs <simon@2ndquadrant.com> wrote:

I've moved this to the next CF. I'm planning to review this one first.

Thank you.

Just reviewing the patch now, making more sense with comments added.

Making more sense, but not yet making complete sense.

I'd like you to revisit the patch comments since some of them are
completely unreadable.

I will once again review all the comments and make them more meaningful.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers