Re: Performance Improvement by reducing WAL for Update Operation

Started by Amit kapilaalmost 13 years ago128 messages

amit.kapila@huawei.com

almost 13 years ago

1 attachment(s)

On Friday, January 11, 2013 11:12 PM Simon Riggs wrote:
On 11 January 2013 17:30, Amit kapila <amit(dot)kapila(at)huawei(dot)com> wrote:

On Friday, January 11, 2013 7:59 PM Alvaro Herrera wrote:
Simon Riggs wrote:

On 28 December 2012 10:21, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

I was also worried about the high variance in the results. Those
averages look rather meaningless. Which would be okay, I think, because
it'd mean that performance-wise the patch is a wash,

For larger tuple sizes (>1000 && < 1800), the performance gain will be good.
Please refer performance results by me and Kyotaro-san in below links:

http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BEAAE32(at)szxeml509-mbx<http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BEAAE32%28at%29szxeml509-mbx>
http://archives.postgresql.org/message-id/20121228(dot)170748(dot)90887322(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp<http://archives.postgresql.org/message-id/20121228%28dot%29170748%28dot%2990887322%28dot%29horiguchi%28dot%29kyotaro%28at%29lab%28dot%29ntt%28dot%29co%28dot%29jp>

AFAICS your tests are badly variable, but as Alvaro says, they aren't
accurate enough to tell there's a regression.

By running performance scenario in suse 11 board, the readings are not varying much except 8 threads, as i feel my board is a 4 core machine.

Performance readings are attached for original, 256, 512, 1000 and 1800 size of records.

Conclusions from the readings:

1. With orignal pgbench there is a max 9% WAL reduction with not much performance difference.
2. With 250 record pgbench there is a max wal reduction of 30% with not much performance difference.
3. With 500 and above record size in pgbench there is an improvement in the performance and wal reduction both.

If the record size increases there is a gain in performance and wal size is reduced as well.

With Regards,

Amit Kapila.

Amit kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Amit kapila (#1)

1 attachment(s)

On 11 January 2013 15:57, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

I've moved this to the next CF. I'm planning to review this one first.

Thank you.

Just reviewing the patch now, making more sense with comments added.

Making more sense, but not yet making complete sense.

I'd like you to revisit the patch comments since some of them are completely unreadable.

I have modified most of the comments in code.

The changes in attached patch are as below:

1. Introduced Encoded WAL Tuple (EWT) to refer to delta encoded tuple for update operation.

It can rename to one of below:

a. WAL Encoded Tuple (WET)

b. Delta Encoded WAL Tuple (DEWT)

c. Delta WAL Encoded Tuple (DWET)

d. any others?

2. I have kept the wording related to compression in modified docs, but i have tried to copy parts completely.

IMO this is required as there are some changes w.r.t LZ compression like for New Data.

3. There is small coding change as it has been overwritten by one of my previous patch patches.

Calculation of approximate length for encoded wal tuple.

Previous Patch:

if ((bp + (2 * new_tup_bitmaplen)) - bstart >= result_max)

New Patch:

if ((bp + (2 + new_tup_bitmaplen)) - bstart >= result_max)

The previous patch calculation was valid if we could have exactly used LZ format.

With Regards,

Amit Kapila.

Attachments:

wal_update_changes_v8.patchapplication/octet-stream; name=wal_update_changes_v8.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,66 ****
--- 60,69 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
  
+ /* guc variable for EWT compression ratio*/
+ int			wal_update_compression_ratio = 25;
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
  #define ATT_IS_PACKABLE(att) \
***************
*** 297,308 **** heap_attisnull(HeapTuple tup, int attnum)
  }
  
  /* ----------------
!  *		nocachegetattr
   *
!  *		This only gets called from fastgetattr() macro, in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
--- 300,312 ----
  }
  
  /* ----------------
!  *		nocachegetattr_with_len
   *
!  *		This only gets called in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor and
!  *		outputs the length of the attribute value.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
***************
*** 320,328 **** heap_attisnull(HeapTuple tup, int attnum)
   * ----------------
   */
  Datum
! nocachegetattr(HeapTuple tuple,
! 			   int attnum,
! 			   TupleDesc tupleDesc)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
--- 324,333 ----
   * ----------------
   */
  Datum
! nocachegetattr_with_len(HeapTuple tuple,
! 						int attnum,
! 						TupleDesc tupleDesc,
! 						Size *len)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
***************
*** 381,386 **** nocachegetattr(HeapTuple tuple,
--- 386,394 ----
  		 */
  		if (att[attnum]->attcacheoff >= 0)
  		{
+ 			if (len)
+ 				*len = att_getlength(att[attnum]->attlen,
+ 									 tp + att[attnum]->attcacheoff);
  			return fetchatt(att[attnum],
  							tp + att[attnum]->attcacheoff);
  		}
***************
*** 507,515 **** nocachegetattr(HeapTuple tuple,
--- 515,536 ----
  		}
  	}
  
+ 	if (len)
+ 		*len = att_getlength(att[attnum]->attlen, tp + off);
  	return fetchatt(att[attnum], tp + off);
  }
  
+ /*
+  *	nocachegetattr
+  */
+ Datum
+ nocachegetattr(HeapTuple tuple,
+ 			   int attnum,
+ 			   TupleDesc tupleDesc)
+ {
+ 	return nocachegetattr_with_len(tuple, attnum, tupleDesc, NULL);
+ }
+ 
  /* ----------------
   *		heap_getsysattr
   *
***************
*** 617,622 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 638,1061 ----
  	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
  }
  
+ /* ----------------
+  * heap_attr_get_length_and_check_equals
+  *
+  *		returns the result of comparison of specified attribute's value for
+  *		input tuples.
+  *		outputs the length of specified attribute's value for
+  *		input tuples.
+  * ----------------
+  */
+ bool
+ heap_attr_get_length_and_check_equals(TupleDesc tupdesc, int attrnum,
+ 									  HeapTuple tup1, HeapTuple tup2,
+ 									Size *tup1_attr_len, Size *tup2_attr_len)
+ {
+ 	Datum		value1,
+ 				value2;
+ 	bool		isnull1,
+ 				isnull2;
+ 	Form_pg_attribute att;
+ 
+ 	*tup1_attr_len = 0;
+ 	*tup2_attr_len = 0;
+ 
+ 	/*
+ 	 * If it's a whole-tuple reference, say "not equal".  It's not really
+ 	 * worth supporting this case, since it could only succeed after a no-op
+ 	 * update, which is hardly a case worth optimizing for.
+ 	 */
+ 	if (attrnum == 0)
+ 		return false;
+ 
+ 	/*
+ 	 * Likewise, automatically say "not equal" for any system attribute other
+ 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ 	 * chain, or even to be set correctly yet in the new tuple.
+ 	 */
+ 	if (attrnum < 0)
+ 	{
+ 		if (attrnum != ObjectIdAttributeNumber &&
+ 			attrnum != TableOidAttributeNumber)
+ 			return false;
+ 	}
+ 
+ 	/*
+ 	 * Extract the corresponding values and length of values.  XXX this is
+ 	 * pretty inefficient if there are many indexed columns.  Should
+ 	 * HeapSatisfiesHOTUpdate do a single heap_deform_tuple call on each
+ 	 * tuple, instead?	But that doesn't work for system columns ...
+ 	 */
+ 	value1 = heap_getattr_with_len(tup1, attrnum, tupdesc, &isnull1, tup1_attr_len);
+ 	value2 = heap_getattr_with_len(tup2, attrnum, tupdesc, &isnull2, tup2_attr_len);
+ 
+ 	/*
+ 	 * If one value is NULL and other is not, then they are certainly not
+ 	 * equal
+ 	 */
+ 	if (isnull1 != isnull2)
+ 		return false;
+ 
+ 	/*
+ 	 * If both are NULL, they can be considered equal.
+ 	 */
+ 	if (isnull1)
+ 		return true;
+ 
+ 	/*
+ 	 * We do simple binary comparison of the two datums.  This may be overly
+ 	 * strict because there can be multiple binary representations for the
+ 	 * same logical value.	But we should be OK as long as there are no false
+ 	 * positives.  Using a type-specific equality operator is messy because
+ 	 * there could be multiple notions of equality in different operator
+ 	 * classes; furthermore, we cannot safely invoke user-defined functions
+ 	 * while holding exclusive buffer lock.
+ 	 */
+ 	if (attrnum <= 0)
+ 	{
+ 		/* The only allowed system columns are OIDs, so do this */
+ 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ 	}
+ 	else
+ 	{
+ 		Assert(attrnum <= tupdesc->natts);
+ 		att = tupdesc->attrs[attrnum - 1];
+ 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ 	}
+ }
+ 
+ /* ----------------
+  * heap_delta_encode
+  *
+  *		Construct a delta Encoded WAL Tuple (EWT) by comparing old and new
+  *		tuple versions w.r.t column boundaries.
+  *
+  *		Encoded WAL Tuple Format:
+  *		Header + Control byte + history reference (2 - 3)bytes
+  *		+ New data (1 byte length + variable data)+ ...
+  *
+  *		Encode Mechanism:
+  *
+  *		Copy the bitmap data from new tuple to the EWT (Encoded WAL Tuple) and
+  *		loop for all attributes to find any modifications in the attributes.
+  *		The unmodified data is encoded as a History Reference in EWT and
+  *		the modified data (if NOT NULL) is encoded as New Data in EWT.
+  *
+  *		The offset values are calculated with respect to the tuple t_hoff
+  *		value. For each column attribute old and new tuple offsets
+  *		are recalculated based on padding in the tuples.
+  *		Once the alignment difference is found between old and new tuple
+  *		versions, then include alignment difference as New Data in EWT.
+  *
+  *		max encoded data length is 75% (default compression rate)
+  *		of original data, If encoded output data length is greater than
+  *		that, original tuple (new tuple version) will be directly stored in
+  *		WAL Tuple.
+  *
+  *
+  *		History Reference:
+  *		If any column is modified then the unmodified columns data till the
+  *		modified column needs to be copied to EWT as a Tag.
+  *
+  *
+  *		New data (modified data):
+  *		First byte repersents the length [0-255] of the modified data,
+  *		followed by the modified data of corresponding length.
+  *
+  *		For more details about Encoded WAL Tuple (EWT) representation,
+  *		refer transam\README
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  PGLZ_Header *encdata)
+ {
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	int			numberOfAttributes;
+ 	int32		new_tup_off = 0,
+ 				old_tup_off = 0,
+ 				temp_off = 0,
+ 				match_off = 0,
+ 				change_off = 0;
+ 	int			attnum;
+ 	int32		data_len,
+ 				old_tup_pad_len,
+ 				new_tup_pad_len;
+ 	Size		old_tup_attr_len,
+ 				new_tup_attr_len;
+ 	bool		is_attr_equals = true;
+ 	unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+ 	unsigned char *bstart = bp;
+ 	char	   *dp = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	char	   *dstart = dp;
+ 	char	   *history;
+ 	unsigned char ctrl_dummy = 0;
+ 	unsigned char *ctrlp = &ctrl_dummy;
+ 	unsigned char ctrlb = 0;
+ 	unsigned char ctrl = 0;
+ 	int32		len,
+ 				old_tup_bitmaplen,
+ 				new_tup_bitmaplen,
+ 				old_tup_len,
+ 				new_tup_len;
+ 	int32		result_size;
+ 	int32		result_max;
+ 
+ 	old_tup_len = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ 	 * delta encode as this is the maximum size of history offset.
+ 	 */
+ 	if (old_tup_len >= PGLZ_HISTORY_SIZE)
+ 		return false;
+ 
+ 	history = (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	old_tup_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * If length of old and new tuple versions vary by more than 50%, include
+ 	 * new as-is
+ 	 */
+ 	if ((new_tup_len <= (old_tup_len >> 1))
+ 		|| (old_tup_len <= (new_tup_len >> 1)))
+ 		return false;
+ 
+ 	/* Required compression ratio for EWT */
+ 	result_max = (new_tup_len * (100 - wal_update_compression_ratio)) / 100;
+ 	encdata->rawsize = new_tup_len;
+ 
+ 	/*
+ 	 * Advance the EWT by adding the approximate length of the operation for
+ 	 * new data as [1 control byte + 1 length byte + data_length] and validate
+ 	 * it with result_max. The same length approximation is used in the
+ 	 * function for New data.
+ 	 */
+ 	if ((bp + (2 + new_tup_bitmaplen)) - bstart >= result_max)
+ 		return false;
+ 
+ 	/* Copy the bitmap data from new tuple to EWT */
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_bitmaplen, dp);
+ 	dstart = dp;
+ 
+ 	/*
+ 	 * Loop through all attributes, if the attribute is modified by the update
+ 	 * operation, store the [Offset,Length] reffering old tuple version till
+ 	 * the last unchanged column in the EWT as History Reference, else store
+ 	 * the [Length,Data] from new tuple version as New Data.
+ 	 */
+ 	numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+ 	for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+ 	{
+ 		if (!heap_attr_get_length_and_check_equals(tupleDesc, attnum, oldtup,
+ 							   newtup, &old_tup_attr_len, &new_tup_attr_len))
+ 		{
+ 			is_attr_equals = false;
+ 			data_len = old_tup_off - match_off;
+ 
+ 			len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 			if ((bp + len) - bstart >= result_max)
+ 				return false;
+ 
+ 			/*
+ 			 * The match_off value is calculated w.r.t to the tuple t_hoff
+ 			 * value, the bit map len needs to be added to match_off to get
+ 			 * the actual start offset from the old/history tuple.
+ 			 */
+ 			match_off += old_tup_bitmaplen;
+ 
+ 			/*
+ 			 * If any unchanged data presents in the old and new tuples then
+ 			 * encode the data as it needs to copy from history tuple with len
+ 			 * and offset.
+ 			 */
+ 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 
+ 			/*
+ 			 * Recalculate the old and new tuple offsets based on padding in
+ 			 * the tuples
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				old_tup_off = att_align_pointer(old_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ 			}
+ 
+ 			if (!HeapTupleHasNulls(newtup)
+ 				|| !att_isnull((attnum - 1), newtup->t_data->t_bits))
+ 			{
+ 				new_tup_off = att_align_pointer(new_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ 			}
+ 
+ 			old_tup_off += old_tup_attr_len;
+ 			new_tup_off += new_tup_attr_len;
+ 
+ 			match_off = old_tup_off;
+ 		}
+ 		else
+ 		{
+ 			data_len = new_tup_off - change_off;
+ 			if ((bp + (2 + data_len)) - bstart >= result_max)
+ 				return false;
+ 
+ 			/* Add the modified column data to the EWT */
+ 			pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 			/*
+ 			 * Calculate the alignment for old and new tuple versions for this
+ 			 * attribute, if the alignment is same, then we continue for next
+ 			 * attribute else 1. stores the [Offset,Length] reffering old
+ 			 * tuple version for previous attribute (if previous attr is same
+ 			 * in old and new tuple versions) in the EWT as History Reference,
+ 			 * 2. add the [Length,Data] for alignment from new tuple as New
+ 			 * Data in EWT.
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				temp_off = old_tup_off;
+ 				old_tup_off = att_align_pointer(old_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ 
+ 				old_tup_pad_len = old_tup_off - temp_off;
+ 
+ 
+ 				temp_off = new_tup_off;
+ 				new_tup_off = att_align_pointer(new_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ 				new_tup_pad_len = new_tup_off - temp_off;
+ 
+ 				if (old_tup_pad_len != new_tup_pad_len)
+ 				{
+ 					/*
+ 					 * If the alignment difference is found between old and
+ 					 * new tuples and previous attribute value of the old and
+ 					 * new tuple versions is same then store until the current
+ 					 * match as history reference Tag in EWT.
+ 					 */
+ 					if (is_attr_equals)
+ 					{
+ 						data_len = old_tup_off - old_tup_pad_len - match_off;
+ 						len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 						if ((bp + len) - bstart >= result_max)
+ 							return false;
+ 
+ 						match_off += old_tup_bitmaplen;
+ 						pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 					}
+ 
+ 					match_off = old_tup_off;
+ 
+ 					/* Alignment data */
+ 					if ((bp + (2 + new_tup_pad_len)) - bstart >= result_max)
+ 						return false;
+ 
+ 					pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_pad_len, dp);
+ 				}
+ 			}
+ 
+ 			old_tup_off += old_tup_attr_len;
+ 			new_tup_off += new_tup_attr_len;
+ 
+ 			change_off = new_tup_off;
+ 
+ 			/*
+ 			 * Recalculate the destination pointer with the new offset which
+ 			 * is used while copying the modified data.
+ 			 */
+ 			dp = dstart + new_tup_off;
+ 			is_attr_equals = true;
+ 		}
+ 	}
+ 
+ 	/* If any modified column data presents then add it in EWT. */
+ 	data_len = new_tup_off - change_off;
+ 	if ((bp + (2 + data_len)) - bstart >= result_max)
+ 		return false;
+ 
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 	/*
+ 	 * If any left out old tuple data is present then copy it as history
+ 	 * reference
+ 	 */
+ 	data_len = old_tup_off - match_off;
+ 	len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 	if ((bp + len) - bstart >= result_max)
+ 		return false;
+ 
+ 	match_off += old_tup_bitmaplen;
+ 	pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 
+ 	/*
+ 	 * Write out the last control byte and check that we haven't overrun the
+ 	 * output size allowed by the strategy.
+ 	 */
+ 	*ctrlp = ctrlb;
+ 
+ 	result_size = bp - bstart;
+ 	if (result_size >= result_max)
+ 		return false;
+ 
+ 	/* Fill in the actual length of the compressed datum */
+ 	SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+ 	return true;
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *
+  *		Decode a tuple using delta-encoded WAL tuple and old tuple version
+  *
+  *		Encoded WAL Tuple Format:
+  *		Header + Control byte + history reference (2 - 3)bytes
+  *		+ New data (1 byte length + variable data)+ ...
+  *
+  *
+  *		Decode Mechanism:
+  *		Skip header and Read one control byte and process the next 8 items (or as many as
+  *		remain in the compressed input).
+  *		Check each control bit, if the bit is set then it is History Reference which
+  *		means the next 2 - 3 byte tag provides the offset and length of history match.
+  *		Use the offset and corresponding length to copy data from old tuple version
+  *		to new tuple.
+  *		If the control bit is unset, then it is New Data Reference which means
+  *		first byte contains the length [0-255] of the modified data, followed
+  *		by the modified data of corresponding length specified in the first byte.
+  *
+  *		Tag in History Reference:
+  *		2-3 byte tag -
+  *		2 byte tag is used when length of History data (unchanged data from old tuple version) is less than 18.
+  *		3 byte tag is used when length of History data (unchanged data from old tuple version) is greater than
+  *		equal to 18.
+  *		The maximum length that can be represented by one Tag is 273.
+  *
+  *		For more details about Encoded WAL Tuple (EWT) representation, refer transam\README
+  *
+  * ----------------
+  */
+ void
+ heap_delta_decode(PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	return pglz_decompress_with_history((char *) encdata,
+ 			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 										&newtup->t_len,
+ 			(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
  /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 85,90 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
--- 85,91 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
+ 				HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
***************
*** 857,862 **** heapgettup_pagemode(HeapScanDesc scan,
--- 858,911 ----
   * definition in access/htup.h is maintained.
   */
  Datum
+ fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+ 					 bool *isnull, int32 *len)
+ {
+ 	return (
+ 			(attnum) > 0 ?
+ 			(
+ 			 (*(isnull) = false),
+ 			 HeapTupleNoNulls(tup) ?
+ 			 (
+ 			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
+ 			  (
+ 			(*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
+ 							 (char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ 							 (tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
+ 			   fetchatt((tupleDesc)->attrs[(attnum) - 1],
+ 						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ 						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
+ 			   )
+ 			  :
+ 			  (
+ 			   nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
+ 			  )
+ 			 :
+ 			 (
+ 			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
+ 			  (
+ 			   (*(isnull) = true),
+ 			   (*(len) = 0),
+ 			   (Datum) NULL
+ 			   )
+ 			  :
+ 			  (
+ 			   nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))
+ 			   )
+ 			  )
+ 			 )
+ 			:
+ 			(
+ 			 (Datum) NULL
+ 			 )
+ 		);
+ }
+ 
+ /*
+  * This is formatted so oddly so that the correspondence to the macro
+  * definition in access/htup.h is maintained.
+  */
+ Datum
  fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull)
  {
***************
*** 873,879 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr((tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
--- 922,929 ----
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  (
! 			   nocachegetattr(tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
***************
*** 3229,3238 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3279,3290 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 
! 		recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 								 newbuf, heaptup, &oldtup,
! 								 all_visible_cleared,
! 								 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
***************
*** 3299,3372 **** static bool
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Datum		value1,
! 				value2;
! 	bool		isnull1,
! 				isnull2;
! 	Form_pg_attribute att;
! 
! 	/*
! 	 * If it's a whole-tuple reference, say "not equal".  It's not really
! 	 * worth supporting this case, since it could only succeed after a no-op
! 	 * update, which is hardly a case worth optimizing for.
! 	 */
! 	if (attrnum == 0)
! 		return false;
! 
! 	/*
! 	 * Likewise, automatically say "not equal" for any system attribute other
! 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
! 	 * chain, or even to be set correctly yet in the new tuple.
! 	 */
! 	if (attrnum < 0)
! 	{
! 		if (attrnum != ObjectIdAttributeNumber &&
! 			attrnum != TableOidAttributeNumber)
! 			return false;
! 	}
! 
! 	/*
! 	 * Extract the corresponding values.  XXX this is pretty inefficient if
! 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
! 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
! 	 * work for system columns ...
! 	 */
! 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
! 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
! 
! 	/*
! 	 * If one value is NULL and other is not, then they are certainly not
! 	 * equal
! 	 */
! 	if (isnull1 != isnull2)
! 		return false;
! 
! 	/*
! 	 * If both are NULL, they can be considered equal.
! 	 */
! 	if (isnull1)
! 		return true;
  
! 	/*
! 	 * We do simple binary comparison of the two datums.  This may be overly
! 	 * strict because there can be multiple binary representations for the
! 	 * same logical value.	But we should be OK as long as there are no false
! 	 * positives.  Using a type-specific equality operator is messy because
! 	 * there could be multiple notions of equality in different operator
! 	 * classes; furthermore, we cannot safely invoke user-defined functions
! 	 * while holding exclusive buffer lock.
! 	 */
! 	if (attrnum <= 0)
! 	{
! 		/* The only allowed system columns are OIDs, so do this */
! 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
! 	}
! 	else
! 	{
! 		Assert(attrnum <= tupdesc->natts);
! 		att = tupdesc->attrs[attrnum - 1];
! 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
! 	}
  }
  
  /*
--- 3351,3361 ----
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Size		tup1_attr_len,
! 				tup2_attr_len;
  
! 	return heap_attr_get_length_and_check_equals(tupdesc, attrnum, tup1, tup2,
! 											 &tup1_attr_len, &tup2_attr_len);
  }
  
  /*
***************
*** 4464,4470 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
--- 4453,4459 ----
   */
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
! 				Buffer newbuf, HeapTuple newtup, HeapTuple oldtup,
  				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
***************
*** 4473,4478 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
--- 4462,4477 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds EWT */
+ 	struct
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[MaxHeapTupleSize];
+ 	}			buf;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 4482,4492 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4481,4522 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * EWT can be generated for all new tuple versions created by Update
+ 	 * operation. Currently we do it when both the old and new tuple versions
+ 	 * are on same page, because during recovery if the page containing old
+ 	 * tuple is corrupt, it should not cascade that corruption to other pages.
+ 	 * Under the general assumption that for long runs most updates tend to
+ 	 * create new tuple version on same page, there should not be significant
+ 	 * impact on WAL reduction or performance.
+ 	 *
+ 	 * We should not generate EWT when we need to backup the whole bolck in
+ 	 * WAL as in that case there is no saving by reduced WAL size.
+ 	 */
+ 	if ((oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ 	{
+ 		/* Delta-encode the new tuple using the old tuple */
+ 		if (heap_delta_encode(reln->rd_att, oldtup, newtup, &buf.pglzheader))
+ 		{
+ 			compressed = true;
+ 			newtupdata = (char *) &buf.pglzheader;
+ 			newtuplen = VARSIZE(&buf.pglzheader);
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 4513,4521 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 4543,4554 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! 	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 5291,5297 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5324,5333 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 5306,5312 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5342,5348 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5366,5372 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
--- 5402,5408 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
***************
*** 5385,5391 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5421,5427 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5410,5416 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5446,5452 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5473,5482 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 5509,5540 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the record is EWT then decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/*
! 		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
! 		 * + New data (1 byte length + variable data)+ ...
! 		 */
! 		PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
! 
! 		oldtup.t_data = oldtupdata;
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode(encoded_data, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 5491,5497 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 5549,5555 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
***************
*** 665,670 **** then restart recovery.  This is part of the reason for not writing a WAL
--- 665,778 ----
  entry until we've successfully done the original action.
  
  
+ Encoded WAL Tuple (EWT)
+ -----------------------
+ 
+ Delta Encoded WAL Tuple (EWT) eliminates the need for copying entire tuple to WAL for the update operation.
+ EWT is constructed by comparing old and new versions of tuple w.r.t column boundaries. It contains the data
+ from new tuple for modified columns and reference [Offset,Length] of old tuple verion for un-changed columns.
+ 
+ 
+ EWT Format
+ ----------
+ 
+ Header + Control byte + History Reference (2 - 3)bytes
+ 	+ New data (1 byte length + variable data) + ...
+ 
+ 
+ Header:
+ 
+ The header is same as PGLZ_Header, which is used to store the compressed length and raw length.
+ 
+ Control byte:
+ 
+ The first byte after the header tells what to do the next 8 times. We call this the control byte.
+ 
+ 
+ History Reference:
+ 
+ A set bit in the control byte means, that a tag of 2-3 bytes follows. A tag contains information
+ to copy some bytes from old tuple version to the current location in the output.
+ 
+    Details about 2-3 byte Tag 
+    2 byte tag is used when length of History data (unchanged data from old tuple version) is less than 18.
+    3 byte tag is used when length of History data (unchanged data from old tuple version) is greater than 
+    equal to 18. 
+    The maximum length that can be represented by one Tag is 273.
+ 
+    Let's call the three tag bytes T1, T2 and T3. The position of the data to copy is coded as an offset
+    from the old tuple.
+ 
+    The offset is in the upper nibble of T1 and in T2.
+    The length is in the lower nibble of T1.
+ 
+    So the 16 bits of a 2 byte tag are coded as
+ 
+ 	7---T1--0  7---T2--0
+ 	OOOO LLLL  OOOO OOOO
+ 	
+    This limits the offset to 1-4095 (12 bits) and the length to 3-18 (4 bits) because 3 is always added to it.
+ 
+    In the actual implementation, the 2 byte tag's length is limited to 3-17, because the value 0xF 
+    in the length nibble has special meaning. It means, that the next following byte (T3) has to be 
+    added to the length value of 18. That makes total limits of 1-4095 for offset and 3-273 for length.
+ 
+ 
+ 
+ 
+ New data:
+ 
+ An unset bit in the control byte represents modified data of new tuple version.
+ First byte repersents the length [0-255] of the modified data, followed by the 
+ modified data of corresponding length.
+ 
+ 	7---T1--0  7---T2--0  ...
+ 	LLLL LLLL  DDDD DDDD  ...
+ 
+     Data bytes repeat until the length of the new data.
+ 
+ 
+ L - Length
+ O - Offset
+ D - Data
+ 
+ This encoding is very similar to LZ Compression used in PostgreSQL (pg_lzcompress.c). 
+ 
+ 
+ Encoding Mechanism for EWT
+ --------------------------
+ Copy the bitmap data from new tuple to the EWT (Encoded WAL Tuple) and loop for all attributes 
+ to find any modifications in the attributes. The unmodified data is encoded as a 
+ History Reference in EWT and the modified data (if NOT NULL) is encoded as New Data in EWT.
+  
+ The offset values are calculated with respect to the tuple t_hoff value. For each column attribute 
+ old and new tuple offsets are recalculated based on padding in the tuples.
+ Once the alignment difference is found between old and new tuple versions, 
+ then include alignment difference as New Data in EWT.
+  
+ Max encoded data length is 75% (default compression rate) of original data, if encoded output data 
+ length is greater thanthat, original tuple (new tuple version) will be directly stored in WAL Tuple.
+ 
+ 
+ Decoding Mechanism for EWT
+ --------------------------
+ Skip header and Read one control byte and process the next 8 items (or as many as remain in the compressed input). 
+ Check each control bit, if the bit is set then it is History Reference which means the next 2 - 3 byte tag 
+ provides the offset and length of history match.
+ Use the offset and corresponding length to copy data from old tuple version to new tuple.
+ If the control bit is unset, then it is New Data Reference which means first byte contains the 
+ length [0-255] of the modified data, followed by the modified data of corresponding length 
+ specified in the first byte.
+ 
+ 
+ Constraints for EWT
+ --------------------
+ 1. Delta encoding is allowed when the update is going to the same page and
+    buffer doesn't need a backup block in case of full-pagewrite is on.
+ 2. Old Tuples with length less than PGLZ_HISTORY_SIZE are allowed for encoding.
+ 3. Old and New tuple versions shouldn't vary in length by more than 50% are allowed for encoding.
+ 
+ 
  Asynchronous Commit
  -------------------
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1204,1209 **** begin:;
--- 1204,1231 ----
  }
  
  /*
+  * Determine whether the buffer referenced has to be backed up. Since we don't
+  * yet have the insert lock, fullPageWrites and forcePageWrites could change
+  * later, but will not cause any problem because this function is used only to
+  * identify whether EWT is required for WAL update.
+  */
+ bool
+ XLogCheckBufferNeedsBackup(Buffer buffer)
+ {
+ 	bool		doPageWrites;
+ 	Page		page;
+ 
+ 	page = BufferGetPage(buffer);
+ 
+ 	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+ 
+ 	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ 		return true;			/* buffer requires backup */
+ 
+ 	return false;				/* buffer does not need to be backed up */
+ }
+ 
+ /*
   * Determine whether the buffer referenced by an XLogRecData item has to
   * be backed up, and if so fill a BkpBlock struct for it.  In any case
   * save the buffer's LSN at *lsn.
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 182,190 ****
   */
  #define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
  #define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
- #define PGLZ_HISTORY_SIZE		4096
- #define PGLZ_MAX_MATCH			273
- 
  
  /* ----------
   * PGLZ_HistEntry -
--- 182,187 ----
***************
*** 302,368 **** do {									\
  			}																\
  } while (0)
  
- 
- /* ----------
-  * pglz_out_ctrl -
-  *
-  *		Outputs the last and allocates a new control byte if needed.
-  * ----------
-  */
- #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
- do { \
- 	if ((__ctrl & 0xff) == 0)												\
- 	{																		\
- 		*(__ctrlp) = __ctrlb;												\
- 		__ctrlp = (__buf)++;												\
- 		__ctrlb = 0;														\
- 		__ctrl = 1;															\
- 	}																		\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_literal -
-  *
-  *		Outputs a literal byte to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	*(_buf)++ = (unsigned char)(_byte);										\
- 	_ctrl <<= 1;															\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_tag -
-  *
-  *		Outputs a backward reference tag of 2-4 bytes (depending on
-  *		offset and length) to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	_ctrlb |= _ctrl;														\
- 	_ctrl <<= 1;															\
- 	if (_len > 17)															\
- 	{																		\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
- 		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
- 		(_buf)[2] = (unsigned char)((_len) - 18);							\
- 		(_buf) += 3;														\
- 	} else {																\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
- 		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
- 		(_buf) += 2;														\
- 	}																		\
- } while (0)
- 
- 
  /* ----------
   * pglz_find_match -
   *
--- 299,304 ----
***************
*** 595,601 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			 * Create the tag and add history entries for all matched
  			 * characters.
  			 */
! 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
  			while (match_len--)
  			{
  				pglz_hist_add(hist_start, hist_entries,
--- 531,537 ----
  			 * Create the tag and add history entries for all matched
  			 * characters.
  			 */
! 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off, dp);
  			while (match_len--)
  			{
  				pglz_hist_add(hist_start, hist_entries,
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 583,620 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest.
+  *		To decompress, it uses history if provided.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 699,726 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  					break;
  				}
  
! 				/*
! 				 * Now we copy the bytes specified by the tag from OUTPUT to
! 				 * OUTPUT. It is dangerous and platform dependent to use
! 				 * memcpy() here, because the copied areas could overlap
! 				 * extremely!
! 				 */
! 				while (len--)
  				{
! 					*dp = dp[-off];
! 					dp++;
  				}
  			}
  			else
  			{
! 				/*
! 				 * An unset control bit means LITERAL BYTE. So we just copy
! 				 * one from INPUT to OUTPUT.
! 				 */
! 				if (dp >= destend)		/* check for buffer overrun */
! 					break;		/* do not clobber memory */
! 
! 				*dp++ = *sp++;
  			}
  
  			/*
--- 658,726 ----
  					break;
  				}
  
! 				if (history)
! 				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from history
! 					 * to OUTPUT.
! 					 */
! 					memcpy(dp, history + off, len);
! 					dp += len;
! 				}
! 				else
  				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from OUTPUT
! 					 * to OUTPUT. It is dangerous and platform dependent to
! 					 * use memcpy() here, because the copied areas could
! 					 * overlap extremely!
! 					 */
! 					while (len--)
! 					{
! 						*dp = dp[-off];
! 						dp++;
! 					}
  				}
  			}
  			else
  			{
! 				if (history)
! 				{
! 					/*
! 					 * The byte at current offset in the source is the length
! 					 * of this literal segment. See pglz_out_add for encoding
! 					 * side.
! 					 */
! 					int32		len;
! 
! 					len = sp[0];
! 					sp += 1;
! 
! 					if (dp + len > destend)
! 					{
! 						dp += len;
! 						break;
! 					}
! 
! 					/*
! 					 * Now we copy the bytes specified by the tag from Source
! 					 * to OUTPUT.
! 					 */
! 					memcpy(dp, sp, len);
! 					dp += len;
! 					sp += len;
! 				}
! 				else
! 				{
! 					/*
! 					 * An unset control bit means LITERAL BYTE. So we just
! 					 * copy one from INPUT to OUTPUT.
! 					 */
! 					if (dp >= destend)	/* check for buffer overrun */
! 						break;	/* do not clobber memory */
! 
! 					*dp++ = *sp++;
! 				}
  			}
  
  			/*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 123,128 **** extern int	CommitSiblings;
--- 123,129 ----
  extern char *default_tablespace;
  extern char *temp_tablespaces;
  extern bool synchronize_seqscans;
+ extern int	wal_update_compression_ratio;
  extern int	ssl_renegotiation_limit;
  extern char *SSLCipherSuites;
  
***************
*** 2382,2387 **** static struct config_int ConfigureNamesInt[] =
--- 2383,2399 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		/* Not for general use */
+ 		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ 			gettext_noop("Sets the compression ratio of delta record for wal update"),
+ 			NULL,
+ 		},
+ 		&wal_update_compression_ratio,
+ 		25, 1, 99,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 142,161 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	int		flags;				/* flag bits, see below */
! 
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! 
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01 /* Indicates as old page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02 /* Indicates as new page's
! 												all visible bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04 /* Indicates as the update
! 												operation is delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 18,23 ****
--- 18,24 ----
  #include "access/tupdesc.h"
  #include "access/tupmacs.h"
  #include "storage/bufpage.h"
+ #include "utils/pg_lzcompress.h"
  
  /*
   * MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
***************
*** 528,533 **** struct MinimalTupleData
--- 529,535 ----
  		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
  
  
+ #if !defined(DISABLE_COMPLEX_MACRO)
  /* ----------------
   *		fastgetattr
   *
***************
*** 542,550 **** struct MinimalTupleData
   *		lookups, and call nocachegetattr() for the rest.
   * ----------------
   */
- 
- #if !defined(DISABLE_COMPLEX_MACRO)
- 
  #define fastgetattr(tup, attnum, tupleDesc, isnull)					\
  (																	\
  	AssertMacro((attnum) > 0),										\
--- 544,549 ----
***************
*** 572,585 **** struct MinimalTupleData
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
  )
- #else							/* defined(DISABLE_COMPLEX_MACRO) */
  
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
- 
  /* ----------------
   *		heap_getattr
   *
--- 571,626 ----
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
+ )																	\
+ 
+ /* ----------------
+  *		fastgetattr_with_len
+  *
+  *		Similar to fastgetattr and fetches the length of the given attribute
+  *		also.
+  * ----------------
+  */
+ #define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len)	\
+ (																	\
+ 	AssertMacro((attnum) > 0),										\
+ 	(*(isnull) = false),											\
+ 	HeapTupleNoNulls(tup) ? 										\
+ 	(																\
+ 		(tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ?			\
+ 		(															\
+ 			(*(len) = att_getlength(								\
+ 					(tupleDesc)->attrs[(attnum)-1]->attlen, 		\
+ 					(char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)),	\
+ 			fetchatt((tupleDesc)->attrs[(attnum)-1],				\
+ 				(char *) (tup)->t_data + (tup)->t_data->t_hoff +	\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)	\
+ 		)															\
+ 		:															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 	)																\
+ 	:																\
+ 	(																\
+ 		att_isnull((attnum)-1, (tup)->t_data->t_bits) ? 			\
+ 		(															\
+ 			(*(isnull) = true), 									\
+ 			(*(len) = 0),											\
+ 			(Datum)NULL 											\
+ 		)															\
+ 		:															\
+ 		(															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 		)															\
+ 	)																\
  )
  
+ #else							/* defined(DISABLE_COMPLEX_MACRO) */
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
+ extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+ 			TupleDesc tupleDesc, bool *isnull, int32 *len);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  /* ----------------
   *		heap_getattr
   *
***************
*** 596,616 **** extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
  	( \
! 		((attnum) > 0) ? \
  		( \
! 			((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
! 			( \
! 				(*(isnull) = true), \
! 				(Datum)NULL \
! 			) \
! 			: \
! 				fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
  		) \
  		: \
! 			heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	)
  
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
--- 637,679 ----
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
+ ( \
+ 	((attnum) > 0) ? \
  	( \
! 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
  		( \
! 			(*(isnull) = true), \
! 			(Datum)NULL \
  		) \
  		: \
! 			fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	) \
! 	: \
! 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
  
+ /* ----------------
+  *		heap_getattr_with_len
+  *
+  *		Similar to heap_getattr and outputs the length of the given attribute.
+  * ----------------
+  */
+ #define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ 	((attnum) > 0) ? \
+ 	( \
+ 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+ 		( \
+ 			(*(isnull) = true), \
+ 			(*(len) = 0), \
+ 			(Datum)NULL \
+ 		) \
+ 		: \
+ 			fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+ 	) \
+ 	: \
+ 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+ )
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
***************
*** 620,625 **** extern void heap_fill_tuple(TupleDesc tupleDesc,
--- 683,690 ----
  				char *data, Size data_size,
  				uint16 *infomask, bits8 *bit);
  extern bool heap_attisnull(HeapTuple tup, int attnum);
+ extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+ 			   TupleDesc att, Size *len);
  extern Datum nocachegetattr(HeapTuple tup, int attnum,
  			   TupleDesc att);
  extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
***************
*** 636,641 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 701,714 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_attr_get_length_and_check_equals(TupleDesc tupdesc,
+ 						int attrnum, HeapTuple tup1, HeapTuple tup2,
+ 						Size *tup1_attr_len, Size *tup2_attr_len);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				HeapTuple newtup, PGLZ_Header *encdata);
+ extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+ 				HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/tupmacs.h
--- b/src/include/access/tupmacs.h
***************
*** 187,192 ****
--- 187,214 ----
  )
  
  /*
+  * att_getlength -
+  *			Gets the length of the attribute.
+  */
+ #define att_getlength(attlen, attptr) \
+ ( \
+ 	((attlen) > 0) ? \
+ 	( \
+ 		(attlen) \
+ 	) \
+ 	: (((attlen) == -1) ? \
+ 	( \
+ 		VARSIZE_ANY(attptr) \
+ 	) \
+ 	: \
+ 	( \
+ 		AssertMacro((attlen) == -2), \
+ 		(strlen((char *) (attptr)) + 1) \
+ 	)) \
+ )
+ 
+ 
+ /*
   * store_att_byval is a partial inverse of fetch_att: store a given Datum
   * value into a tuple data area at the specified address.  However, it only
   * handles the byval case, because in typical usage the caller needs to
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 261,266 **** typedef struct CheckpointStatsData
--- 261,267 ----
  extern CheckpointStatsData CheckpointStats;
  
  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+ extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
  extern void XLogFlush(XLogRecPtr RecPtr);
  extern bool XLogBackgroundFlush(void);
  extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 23,28 **** typedef struct PGLZ_Header
--- 23,31 ----
  	int32		rawsize;
  } PGLZ_Header;
  
+ /* LZ algorithm can hold only history offset in the range of 1 - 4095. */
+ #define PGLZ_HISTORY_SIZE		4096
+ #define PGLZ_MAX_MATCH			273
  
  /* ----------
   * PGLZ_MAX_OUTPUT -
***************
*** 86,91 **** typedef struct PGLZ_Strategy
--- 89,207 ----
  	int32		match_size_drop;
  } PGLZ_Strategy;
  
+ /*
+  * calculate the approximate length required for history reference tag for the
+  * given length
+  */
+ #define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+ ( \
+ 	((_len) < 17) ?	(3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH)))	\
+ )
+ 
+ /* ----------
+  * pglz_out_ctrl -
+  *
+  *		Outputs the last and allocates a new control byte if needed.
+  * ----------
+  */
+ #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+ do { \
+ 	if ((__ctrl & 0xff) == 0)												\
+ 	{																		\
+ 		*(__ctrlp) = __ctrlb;												\
+ 		__ctrlp = (__buf)++;												\
+ 		__ctrlb = 0;														\
+ 		__ctrl = 1; 														\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_literal -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+ do { \
+ 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 	*(_buf)++ = (unsigned char)(_byte); 									\
+ 	_ctrl <<= 1;															\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_tag -
+  *
+  *		Outputs a backward/history reference tag of 2-3 bytes (depending on
+  *		offset and length) to the destination buffer including the
+  *		appropriate control bit.
+  *
+  *		Split the process of backward/history reference as different chunks,
+  *		if the given length is more than max match and repeats the process
+  *		until the given length is processed.
+  *
+  *		If the matched history length is less than 3 bytes then add it as a
+  *		new data only during encoding instead of history reference. This occurs
+  *		only while framing EWT.
+  * ----------
+  */
+ #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_byte) \
+ do { \
+ 	int _mtaglen;															\
+ 	int _tagtotal_len = (_len);												\
+ 	while (_tagtotal_len > 0)												\
+ 	{																		\
+ 		_mtaglen = _tagtotal_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _tagtotal_len;	\
+ 		if (_mtaglen < 3)													\
+ 		{																	\
+ 			char *_data = (char *)(_byte) + (_off);							\
+ 			pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_mtaglen,_data);			\
+ 			break;															\
+ 		}																	\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);							\
+ 		_ctrlb |= _ctrl;													\
+ 		_ctrl <<= 1;														\
+ 		if (_mtaglen > 17)													\
+ 		{																	\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);	\
+ 			(_buf)[1] = (unsigned char)(((_off) & 0xff));					\
+ 			(_buf)[2] = (unsigned char)((_mtaglen) - 18);					\
+ 			(_buf) += 3;													\
+ 		} else {															\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mtaglen) - 3)); \
+ 			(_buf)[1] = (unsigned char)((_off) & 0xff);						\
+ 			(_buf) += 2;													\
+ 		}																	\
+ 		_tagtotal_len -= _mtaglen;											\
+ 		(_off) += _mtaglen;													\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_add -
+  *
+  *		Outputs a reference tag of 1 byte with length and the new data
+  *		to the destination buffer, including the appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ 	int32 _maddlen;															\
+ 	int32 _addtotal_len = (_len);											\
+ 	while (_addtotal_len > 0)												\
+ 	{																		\
+ 		_maddlen = _addtotal_len > 255 ? 255 : _addtotal_len;				\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);							\
+ 		_ctrl <<= 1;														\
+ 		(_buf)[0] = (unsigned char)(_maddlen);								\
+ 		(_buf) += 1;														\
+ 		memcpy((_buf), (_byte), _maddlen);									\
+ 		(_buf) += _maddlen;													\
+ 		(_byte) += _maddlen;												\
+ 		_addtotal_len -= _maddlen;											\
+ 	}																		\
+ } while (0)
+ 
  
  /* ----------
   * The standard strategies
***************
*** 108,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! 
  #endif   /* _PG_LZCOMPRESS_H_ */
--- 224,229 ----
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! extern void pglz_decompress_with_history(const char *source, char *dest,
! 				uint32 *destlen, const char *history);
  #endif   /* _PG_LZCOMPRESS_H_ */
*** a/src/test/regress/expected/update.out
--- b/src/test/regress/expected/update.out
***************
*** 97,99 **** SELECT a, b, char_length(c) FROM update_test;
--- 97,169 ----
  (2 rows)
  
  DROP TABLE update_test;
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ DROP TABLE IF EXISTS update_test;
+ NOTICE:  table "update_test" does not exist, skipping
+ CREATE TABLE update_test (
+ 		bser bigserial,
+ 		bln boolean,
+ 		ename VARCHAR(25),
+ 		perf_f float(8),
+ 		grade CHAR,
+ 		dept CHAR(5) NOT NULL,
+ 		dob DATE,
+ 		idnum INT,
+ 		addr VARCHAR(30) NOT NULL,
+ 		destn CHAR(6),
+ 		Gend CHAR,
+ 		samba BIGINT,
+ 		hgt float,
+ 		ctime TIME
+ );
+ INSERT INTO update_test VALUES (
+ 		nextval('update_test_bser_seq'::regclass),
+ 		TRUE,
+ 		'Test',
+ 		7.169,
+ 		'B',
+ 		'CSD',
+ 		'2000-01-01',
+ 		520,
+ 		'road2,
+ 		streeeeet2,
+ 		city2',
+ 		'dcy2',
+ 		'M',
+ 		12000,
+ 		50.4,
+ 		'00:00:00.0'
+ );
+ SELECT * from update_test;
+  bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+     1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+       |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+       |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+ (1 row)
+ 
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ SELECT * from update_test;
+  bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+     1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+       |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+       |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+ (1 row)
+ 
+ DROP TABLE update_test;
*** a/src/test/regress/sql/update.sql
--- b/src/test/regress/sql/update.sql
***************
*** 59,61 **** UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
--- 59,128 ----
  SELECT a, b, char_length(c) FROM update_test;
  
  DROP TABLE update_test;
+ 
+ 
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ 
+ DROP TABLE IF EXISTS update_test;
+ CREATE TABLE update_test (
+ 		bser bigserial,
+ 		bln boolean,
+ 		ename VARCHAR(25),
+ 		perf_f float(8),
+ 		grade CHAR,
+ 		dept CHAR(5) NOT NULL,
+ 		dob DATE,
+ 		idnum INT,
+ 		addr VARCHAR(30) NOT NULL,
+ 		destn CHAR(6),
+ 		Gend CHAR,
+ 		samba BIGINT,
+ 		hgt float,
+ 		ctime TIME
+ );
+ 
+ INSERT INTO update_test VALUES (
+ 		nextval('update_test_bser_seq'::regclass),
+ 		TRUE,
+ 		'Test',
+ 		7.169,
+ 		'B',
+ 		'CSD',
+ 		'2000-01-01',
+ 		520,
+ 		'road2,
+ 		streeeeet2,
+ 		city2',
+ 		'dcy2',
+ 		'M',
+ 		12000,
+ 		50.4,
+ 		'00:00:00.0'
+ );
+ 
+ SELECT * from update_test;
+ 
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ 
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ 
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ 
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ 
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ 
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ 
+ SELECT * from update_test;
+ DROP TABLE update_test;

Import Notes

Resolved by subject fallback

Amit Kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Amit kapila (#2)

1 attachment(s)

On Monday, January 21, 2013 9:32 PM Amit kapila wrote:
On 11 January 2013 15:57, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

I've moved this to the next CF. I'm planning to review this one first.

Thank you.

Just reviewing the patch now, making more sense with comments added.

Making more sense, but not yet making complete sense.

I'd like you to revisit the patch comments since some of them are

completely unreadable.

I have modified most of the comments in code.
The changes in attached patch are as below:

Rebased the patch as per HEAD.

With Regards,
Amit Kapila.

Attachments:

wal_update_changes_v9.patchapplication/octet-stream; name=wal_update_changes_v9.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,66 ****
--- 60,69 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
  
+ /* guc variable for EWT compression ratio*/
+ int			wal_update_compression_ratio = 25;
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
  #define ATT_IS_PACKABLE(att) \
***************
*** 297,308 **** heap_attisnull(HeapTuple tup, int attnum)
  }
  
  /* ----------------
!  *		nocachegetattr
   *
!  *		This only gets called from fastgetattr() macro, in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
--- 300,312 ----
  }
  
  /* ----------------
!  *		nocachegetattr_with_len
   *
!  *		This only gets called in cases where
   *		we can't use a cacheoffset and the value is not null.
   *
!  *		This caches attribute offsets in the attribute descriptor and
!  *		outputs the length of the attribute value.
   *
   *		An alternative way to speed things up would be to cache offsets
   *		with the tuple, but that seems more difficult unless you take
***************
*** 320,328 **** heap_attisnull(HeapTuple tup, int attnum)
   * ----------------
   */
  Datum
! nocachegetattr(HeapTuple tuple,
! 			   int attnum,
! 			   TupleDesc tupleDesc)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
--- 324,333 ----
   * ----------------
   */
  Datum
! nocachegetattr_with_len(HeapTuple tuple,
! 						int attnum,
! 						TupleDesc tupleDesc,
! 						Size *len)
  {
  	HeapTupleHeader tup = tuple->t_data;
  	Form_pg_attribute *att = tupleDesc->attrs;
***************
*** 381,386 **** nocachegetattr(HeapTuple tuple,
--- 386,394 ----
  		 */
  		if (att[attnum]->attcacheoff >= 0)
  		{
+ 			if (len)
+ 				*len = att_getlength(att[attnum]->attlen,
+ 									 tp + att[attnum]->attcacheoff);
  			return fetchatt(att[attnum],
  							tp + att[attnum]->attcacheoff);
  		}
***************
*** 507,515 **** nocachegetattr(HeapTuple tuple,
--- 515,536 ----
  		}
  	}
  
+ 	if (len)
+ 		*len = att_getlength(att[attnum]->attlen, tp + off);
  	return fetchatt(att[attnum], tp + off);
  }
  
+ /*
+  *	nocachegetattr
+  */
+ Datum
+ nocachegetattr(HeapTuple tuple,
+ 			   int attnum,
+ 			   TupleDesc tupleDesc)
+ {
+ 	return nocachegetattr_with_len(tuple, attnum, tupleDesc, NULL);
+ }
+ 
  /* ----------------
   *		heap_getsysattr
   *
***************
*** 617,622 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 638,1061 ----
  	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
  }
  
+ /* ----------------
+  * heap_attr_get_length_and_check_equals
+  *
+  *		returns the result of comparison of specified attribute's value for
+  *		input tuples.
+  *		outputs the length of specified attribute's value for
+  *		input tuples.
+  * ----------------
+  */
+ bool
+ heap_attr_get_length_and_check_equals(TupleDesc tupdesc, int attrnum,
+ 									  HeapTuple tup1, HeapTuple tup2,
+ 									Size *tup1_attr_len, Size *tup2_attr_len)
+ {
+ 	Datum		value1,
+ 				value2;
+ 	bool		isnull1,
+ 				isnull2;
+ 	Form_pg_attribute att;
+ 
+ 	*tup1_attr_len = 0;
+ 	*tup2_attr_len = 0;
+ 
+ 	/*
+ 	 * If it's a whole-tuple reference, say "not equal".  It's not really
+ 	 * worth supporting this case, since it could only succeed after a no-op
+ 	 * update, which is hardly a case worth optimizing for.
+ 	 */
+ 	if (attrnum == 0)
+ 		return false;
+ 
+ 	/*
+ 	 * Likewise, automatically say "not equal" for any system attribute other
+ 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ 	 * chain, or even to be set correctly yet in the new tuple.
+ 	 */
+ 	if (attrnum < 0)
+ 	{
+ 		if (attrnum != ObjectIdAttributeNumber &&
+ 			attrnum != TableOidAttributeNumber)
+ 			return false;
+ 	}
+ 
+ 	/*
+ 	 * Extract the corresponding values and length of values.  XXX this is
+ 	 * pretty inefficient if there are many indexed columns.  Should
+ 	 * HeapSatisfiesHOTUpdate do a single heap_deform_tuple call on each
+ 	 * tuple, instead?	But that doesn't work for system columns ...
+ 	 */
+ 	value1 = heap_getattr_with_len(tup1, attrnum, tupdesc, &isnull1, tup1_attr_len);
+ 	value2 = heap_getattr_with_len(tup2, attrnum, tupdesc, &isnull2, tup2_attr_len);
+ 
+ 	/*
+ 	 * If one value is NULL and other is not, then they are certainly not
+ 	 * equal
+ 	 */
+ 	if (isnull1 != isnull2)
+ 		return false;
+ 
+ 	/*
+ 	 * If both are NULL, they can be considered equal.
+ 	 */
+ 	if (isnull1)
+ 		return true;
+ 
+ 	/*
+ 	 * We do simple binary comparison of the two datums.  This may be overly
+ 	 * strict because there can be multiple binary representations for the
+ 	 * same logical value.	But we should be OK as long as there are no false
+ 	 * positives.  Using a type-specific equality operator is messy because
+ 	 * there could be multiple notions of equality in different operator
+ 	 * classes; furthermore, we cannot safely invoke user-defined functions
+ 	 * while holding exclusive buffer lock.
+ 	 */
+ 	if (attrnum <= 0)
+ 	{
+ 		/* The only allowed system columns are OIDs, so do this */
+ 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ 	}
+ 	else
+ 	{
+ 		Assert(attrnum <= tupdesc->natts);
+ 		att = tupdesc->attrs[attrnum - 1];
+ 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ 	}
+ }
+ 
+ /* ----------------
+  * heap_delta_encode
+  *
+  *		Construct a delta Encoded WAL Tuple (EWT) by comparing old and new
+  *		tuple versions w.r.t column boundaries.
+  *
+  *		Encoded WAL Tuple Format:
+  *		Header + Control byte + history reference (2 - 3)bytes
+  *		+ New data (1 byte length + variable data)+ ...
+  *
+  *		Encode Mechanism:
+  *
+  *		Copy the bitmap data from new tuple to the EWT (Encoded WAL Tuple) and
+  *		loop for all attributes to find any modifications in the attributes.
+  *		The unmodified data is encoded as a History Reference in EWT and
+  *		the modified data (if NOT NULL) is encoded as New Data in EWT.
+  *
+  *		The offset values are calculated with respect to the tuple t_hoff
+  *		value. For each column attribute old and new tuple offsets
+  *		are recalculated based on padding in the tuples.
+  *		Once the alignment difference is found between old and new tuple
+  *		versions, then include alignment difference as New Data in EWT.
+  *
+  *		max encoded data length is 75% (default compression rate)
+  *		of original data, If encoded output data length is greater than
+  *		that, original tuple (new tuple version) will be directly stored in
+  *		WAL Tuple.
+  *
+  *
+  *		History Reference:
+  *		If any column is modified then the unmodified columns data till the
+  *		modified column needs to be copied to EWT as a Tag.
+  *
+  *
+  *		New data (modified data):
+  *		First byte repersents the length [0-255] of the modified data,
+  *		followed by the modified data of corresponding length.
+  *
+  *		For more details about Encoded WAL Tuple (EWT) representation,
+  *		refer transam\README
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  PGLZ_Header *encdata)
+ {
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	int			numberOfAttributes;
+ 	int32		new_tup_off = 0,
+ 				old_tup_off = 0,
+ 				temp_off = 0,
+ 				match_off = 0,
+ 				change_off = 0;
+ 	int			attnum;
+ 	int32		data_len,
+ 				old_tup_pad_len,
+ 				new_tup_pad_len;
+ 	Size		old_tup_attr_len,
+ 				new_tup_attr_len;
+ 	bool		is_attr_equals = true;
+ 	unsigned char *bp = (unsigned char *) encdata + sizeof(PGLZ_Header);
+ 	unsigned char *bstart = bp;
+ 	char	   *dp = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	char	   *dstart = dp;
+ 	char	   *history;
+ 	unsigned char ctrl_dummy = 0;
+ 	unsigned char *ctrlp = &ctrl_dummy;
+ 	unsigned char ctrlb = 0;
+ 	unsigned char ctrl = 0;
+ 	int32		len,
+ 				old_tup_bitmaplen,
+ 				new_tup_bitmaplen,
+ 				old_tup_len,
+ 				new_tup_len;
+ 	int32		result_size;
+ 	int32		result_max;
+ 
+ 	old_tup_len = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ 	 * delta encode as this is the maximum size of history offset.
+ 	 */
+ 	if (old_tup_len >= PGLZ_HISTORY_SIZE)
+ 		return false;
+ 
+ 	history = (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	old_tup_bitmaplen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_bitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	new_tup_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * If length of old and new tuple versions vary by more than 50%, include
+ 	 * new as-is
+ 	 */
+ 	if ((new_tup_len <= (old_tup_len >> 1))
+ 		|| (old_tup_len <= (new_tup_len >> 1)))
+ 		return false;
+ 
+ 	/* Required compression ratio for EWT */
+ 	result_max = (new_tup_len * (100 - wal_update_compression_ratio)) / 100;
+ 	encdata->rawsize = new_tup_len;
+ 
+ 	/*
+ 	 * Advance the EWT by adding the approximate length of the operation for
+ 	 * new data as [1 control byte + 1 length byte + data_length] and validate
+ 	 * it with result_max. The same length approximation is used in the
+ 	 * function for New data.
+ 	 */
+ 	if ((bp + (2 + new_tup_bitmaplen)) - bstart >= result_max)
+ 		return false;
+ 
+ 	/* Copy the bitmap data from new tuple to EWT */
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_bitmaplen, dp);
+ 	dstart = dp;
+ 
+ 	/*
+ 	 * Loop through all attributes, if the attribute is modified by the update
+ 	 * operation, store the [Offset,Length] reffering old tuple version till
+ 	 * the last unchanged column in the EWT as History Reference, else store
+ 	 * the [Length,Data] from new tuple version as New Data.
+ 	 */
+ 	numberOfAttributes = HeapTupleHeaderGetNatts(newtup->t_data);
+ 	for (attnum = 1; attnum <= numberOfAttributes; attnum++)
+ 	{
+ 		if (!heap_attr_get_length_and_check_equals(tupleDesc, attnum, oldtup,
+ 							   newtup, &old_tup_attr_len, &new_tup_attr_len))
+ 		{
+ 			is_attr_equals = false;
+ 			data_len = old_tup_off - match_off;
+ 
+ 			len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 			if ((bp + len) - bstart >= result_max)
+ 				return false;
+ 
+ 			/*
+ 			 * The match_off value is calculated w.r.t to the tuple t_hoff
+ 			 * value, the bit map len needs to be added to match_off to get
+ 			 * the actual start offset from the old/history tuple.
+ 			 */
+ 			match_off += old_tup_bitmaplen;
+ 
+ 			/*
+ 			 * If any unchanged data presents in the old and new tuples then
+ 			 * encode the data as it needs to copy from history tuple with len
+ 			 * and offset.
+ 			 */
+ 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 
+ 			/*
+ 			 * Recalculate the old and new tuple offsets based on padding in
+ 			 * the tuples
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				old_tup_off = att_align_pointer(old_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ 			}
+ 
+ 			if (!HeapTupleHasNulls(newtup)
+ 				|| !att_isnull((attnum - 1), newtup->t_data->t_bits))
+ 			{
+ 				new_tup_off = att_align_pointer(new_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ 			}
+ 
+ 			old_tup_off += old_tup_attr_len;
+ 			new_tup_off += new_tup_attr_len;
+ 
+ 			match_off = old_tup_off;
+ 		}
+ 		else
+ 		{
+ 			data_len = new_tup_off - change_off;
+ 			if ((bp + (2 + data_len)) - bstart >= result_max)
+ 				return false;
+ 
+ 			/* Add the modified column data to the EWT */
+ 			pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 			/*
+ 			 * Calculate the alignment for old and new tuple versions for this
+ 			 * attribute, if the alignment is same, then we continue for next
+ 			 * attribute else 1. stores the [Offset,Length] reffering old
+ 			 * tuple version for previous attribute (if previous attr is same
+ 			 * in old and new tuple versions) in the EWT as History Reference,
+ 			 * 2. add the [Length,Data] for alignment from new tuple as New
+ 			 * Data in EWT.
+ 			 */
+ 			if (!HeapTupleHasNulls(oldtup)
+ 				|| !att_isnull((attnum - 1), oldtup->t_data->t_bits))
+ 			{
+ 				temp_off = old_tup_off;
+ 				old_tup_off = att_align_pointer(old_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) oldtup->t_data + oldtup->t_data->t_hoff + old_tup_off);
+ 
+ 				old_tup_pad_len = old_tup_off - temp_off;
+ 
+ 
+ 				temp_off = new_tup_off;
+ 				new_tup_off = att_align_pointer(new_tup_off,
+ 												att[attnum - 1]->attalign,
+ 												att[attnum - 1]->attlen,
+ 												(char *) newtup->t_data + newtup->t_data->t_hoff + new_tup_off);
+ 				new_tup_pad_len = new_tup_off - temp_off;
+ 
+ 				if (old_tup_pad_len != new_tup_pad_len)
+ 				{
+ 					/*
+ 					 * If the alignment difference is found between old and
+ 					 * new tuples and previous attribute value of the old and
+ 					 * new tuple versions is same then store until the current
+ 					 * match as history reference Tag in EWT.
+ 					 */
+ 					if (is_attr_equals)
+ 					{
+ 						data_len = old_tup_off - old_tup_pad_len - match_off;
+ 						len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 						if ((bp + len) - bstart >= result_max)
+ 							return false;
+ 
+ 						match_off += old_tup_bitmaplen;
+ 						pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 					}
+ 
+ 					match_off = old_tup_off;
+ 
+ 					/* Alignment data */
+ 					if ((bp + (2 + new_tup_pad_len)) - bstart >= result_max)
+ 						return false;
+ 
+ 					pglz_out_add(ctrlp, ctrlb, ctrl, bp, new_tup_pad_len, dp);
+ 				}
+ 			}
+ 
+ 			old_tup_off += old_tup_attr_len;
+ 			new_tup_off += new_tup_attr_len;
+ 
+ 			change_off = new_tup_off;
+ 
+ 			/*
+ 			 * Recalculate the destination pointer with the new offset which
+ 			 * is used while copying the modified data.
+ 			 */
+ 			dp = dstart + new_tup_off;
+ 			is_attr_equals = true;
+ 		}
+ 	}
+ 
+ 	/* If any modified column data presents then add it in EWT. */
+ 	data_len = new_tup_off - change_off;
+ 	if ((bp + (2 + data_len)) - bstart >= result_max)
+ 		return false;
+ 
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, data_len, dp);
+ 
+ 	/*
+ 	 * If any left out old tuple data is present then copy it as history
+ 	 * reference
+ 	 */
+ 	data_len = old_tup_off - match_off;
+ 	len = PGLZ_GET_HIST_CTRL_BIT_LEN(data_len);
+ 	if ((bp + len) - bstart >= result_max)
+ 		return false;
+ 
+ 	match_off += old_tup_bitmaplen;
+ 	pglz_out_tag(ctrlp, ctrlb, ctrl, bp, data_len, match_off, history);
+ 
+ 	/*
+ 	 * Write out the last control byte and check that we haven't overrun the
+ 	 * output size allowed by the strategy.
+ 	 */
+ 	*ctrlp = ctrlb;
+ 
+ 	result_size = bp - bstart;
+ 	if (result_size >= result_max)
+ 		return false;
+ 
+ 	/* Fill in the actual length of the compressed datum */
+ 	SET_VARSIZE_COMPRESSED(encdata, result_size + sizeof(PGLZ_Header));
+ 	return true;
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *
+  *		Decode a tuple using delta-encoded WAL tuple and old tuple version
+  *
+  *		Encoded WAL Tuple Format:
+  *		Header + Control byte + history reference (2 - 3)bytes
+  *		+ New data (1 byte length + variable data)+ ...
+  *
+  *
+  *		Decode Mechanism:
+  *		Skip header and Read one control byte and process the next 8 items (or as many as
+  *		remain in the compressed input).
+  *		Check each control bit, if the bit is set then it is History Reference which
+  *		means the next 2 - 3 byte tag provides the offset and length of history match.
+  *		Use the offset and corresponding length to copy data from old tuple version
+  *		to new tuple.
+  *		If the control bit is unset, then it is New Data Reference which means
+  *		first byte contains the length [0-255] of the modified data, followed
+  *		by the modified data of corresponding length specified in the first byte.
+  *
+  *		Tag in History Reference:
+  *		2-3 byte tag -
+  *		2 byte tag is used when length of History data (unchanged data from old tuple version) is less than 18.
+  *		3 byte tag is used when length of History data (unchanged data from old tuple version) is greater than
+  *		equal to 18.
+  *		The maximum length that can be represented by one Tag is 273.
+  *
+  *		For more details about Encoded WAL Tuple (EWT) representation, refer transam\README
+  *
+  * ----------------
+  */
+ void
+ heap_delta_decode(PGLZ_Header *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	return pglz_decompress_with_history((char *) encdata,
+ 			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 										&newtup->t_len,
+ 			(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
  /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 950,955 **** heapgettup_pagemode(HeapScanDesc scan,
--- 950,1003 ----
   * definition in access/htup.h is maintained.
   */
  Datum
+ fastgetattr_with_len(HeapTuple tup, int attnum, TupleDesc tupleDesc,
+ 					 bool *isnull, int32 *len)
+ {
+ 	return (
+ 			(attnum) > 0 ?
+ 			(
+ 			 (*(isnull) = false),
+ 			 HeapTupleNoNulls(tup) ?
+ 			 (
+ 			  (tupleDesc)->attrs[(attnum) - 1]->attcacheoff >= 0 ?
+ 			  (
+ 			(*(len) = att_getlength((tupleDesc)->attrs[(attnum - 1)]->attlen,
+ 							 (char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ 							 (tupleDesc)->attrs[(attnum) - 1]->attcacheoff)),
+ 			   fetchatt((tupleDesc)->attrs[(attnum) - 1],
+ 						(char *) (tup)->t_data + (tup)->t_data->t_hoff +
+ 						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
+ 			   )
+ 			  :
+ 			  (
+ 			   nocachegetattr_with_len(tup), (attnum), (tupleDesc), (len))
+ 			  )
+ 			 :
+ 			 (
+ 			  att_isnull((attnum) - 1, (tup)->t_data->t_bits) ?
+ 			  (
+ 			   (*(isnull) = true),
+ 			   (*(len) = 0),
+ 			   (Datum) NULL
+ 			   )
+ 			  :
+ 			  (
+ 			   nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))
+ 			   )
+ 			  )
+ 			 )
+ 			:
+ 			(
+ 			 (Datum) NULL
+ 			 )
+ 		);
+ }
+ 
+ /*
+  * This is formatted so oddly so that the correspondence to the macro
+  * definition in access/htup.h is maintained.
+  */
+ Datum
  fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull)
  {
***************
*** 966,972 **** fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  nocachegetattr((tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
--- 1014,1021 ----
  						(tupleDesc)->attrs[(attnum) - 1]->attcacheoff)
  			   )
  			  :
! 			  (
! 			   nocachegetattr(tup), (attnum), (tupleDesc))
  			  )
  			 :
  			 (
***************
*** 3609,3682 **** static bool
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Datum		value1,
! 				value2;
! 	bool		isnull1,
! 				isnull2;
! 	Form_pg_attribute att;
! 
! 	/*
! 	 * If it's a whole-tuple reference, say "not equal".  It's not really
! 	 * worth supporting this case, since it could only succeed after a no-op
! 	 * update, which is hardly a case worth optimizing for.
! 	 */
! 	if (attrnum == 0)
! 		return false;
! 
! 	/*
! 	 * Likewise, automatically say "not equal" for any system attribute other
! 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
! 	 * chain, or even to be set correctly yet in the new tuple.
! 	 */
! 	if (attrnum < 0)
! 	{
! 		if (attrnum != ObjectIdAttributeNumber &&
! 			attrnum != TableOidAttributeNumber)
! 			return false;
! 	}
! 
! 	/*
! 	 * Extract the corresponding values.  XXX this is pretty inefficient if
! 	 * there are many indexed columns.	Should HeapSatisfiesHOTandKeyUpdate do a
! 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
! 	 * work for system columns ...
! 	 */
! 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
! 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
  
! 	/*
! 	 * If one value is NULL and other is not, then they are certainly not
! 	 * equal
! 	 */
! 	if (isnull1 != isnull2)
! 		return false;
! 
! 	/*
! 	 * If both are NULL, they can be considered equal.
! 	 */
! 	if (isnull1)
! 		return true;
! 
! 	/*
! 	 * We do simple binary comparison of the two datums.  This may be overly
! 	 * strict because there can be multiple binary representations for the
! 	 * same logical value.	But we should be OK as long as there are no false
! 	 * positives.  Using a type-specific equality operator is messy because
! 	 * there could be multiple notions of equality in different operator
! 	 * classes; furthermore, we cannot safely invoke user-defined functions
! 	 * while holding exclusive buffer lock.
! 	 */
! 	if (attrnum <= 0)
! 	{
! 		/* The only allowed system columns are OIDs, so do this */
! 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
! 	}
! 	else
! 	{
! 		Assert(attrnum <= tupdesc->natts);
! 		att = tupdesc->attrs[attrnum - 1];
! 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
! 	}
  }
  
  /*
--- 3658,3668 ----
  heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
  					   HeapTuple tup1, HeapTuple tup2)
  {
! 	Size		tup1_attr_len,
! 				tup2_attr_len;
  
! 	return heap_attr_get_length_and_check_equals(tupdesc, attrnum, tup1, tup2,
! 											 &tup1_attr_len, &tup2_attr_len);
  }
  
  /*
***************
*** 5765,5770 **** log_heap_update(Relation reln, Buffer oldbuf,
--- 5751,5766 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds EWT */
+ 	struct
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[MaxHeapTupleSize];
+ 	}			buf;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 5774,5788 **** log_heap_update(Relation reln, Buffer oldbuf,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = oldtup->t_self;
  	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
  	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
  											  oldtup->t_data->t_infomask2);
  	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 5770,5815 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+  	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * EWT can be generated for all new tuple versions created by Update
+ 	 * operation. Currently we do it when both the old and new tuple versions
+ 	 * are on same page, because during recovery if the page containing old
+ 	 * tuple is corrupt, it should not cascade that corruption to other pages.
+ 	 * Under the general assumption that for long runs most updates tend to
+ 	 * create new tuple version on same page, there should not be significant
+ 	 * impact on WAL reduction or performance.
+ 	 *
+ 	 * We should not generate EWT when we need to backup the whole bolck in
+ 	 * WAL as in that case there is no saving by reduced WAL size.
+ 	 */
+ 	if ((oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ 	{
+ 		/* Delta-encode the new tuple using the old tuple */
+ 		if (heap_delta_encode(reln->rd_att, oldtup, newtup, &buf.pglzheader))
+ 		{
+ 			compressed = true;
+ 			newtupdata = (char *) &buf.pglzheader;
+ 			newtuplen = VARSIZE(&buf.pglzheader);
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = oldtup->t_self;
  	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
  	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
  											  oldtup->t_data->t_infomask2);
  	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 5809,5817 **** log_heap_update(Relation reln, Buffer oldbuf,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 5836,5847 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! 	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 6614,6620 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 6644,6653 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 6629,6635 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 6662,6668 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 6689,6695 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
  	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
--- 6722,6728 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
  	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
***************
*** 6707,6713 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 6740,6746 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 6732,6738 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 6765,6771 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 6795,6804 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 6828,6859 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the record is EWT then decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/*
! 		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
! 		 * + New data (1 byte length + variable data)+ ...
! 		 */
! 		PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
! 
! 		oldtup.t_data = oldtupdata;
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode(encoded_data, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 6814,6820 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 6869,6875 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
***************
*** 665,670 **** then restart recovery.  This is part of the reason for not writing a WAL
--- 665,778 ----
  entry until we've successfully done the original action.
  
  
+ Encoded WAL Tuple (EWT)
+ -----------------------
+ 
+ Delta Encoded WAL Tuple (EWT) eliminates the need for copying entire tuple to WAL for the update operation.
+ EWT is constructed by comparing old and new versions of tuple w.r.t column boundaries. It contains the data
+ from new tuple for modified columns and reference [Offset,Length] of old tuple verion for un-changed columns.
+ 
+ 
+ EWT Format
+ ----------
+ 
+ Header + Control byte + History Reference (2 - 3)bytes
+ 	+ New data (1 byte length + variable data) + ...
+ 
+ 
+ Header:
+ 
+ The header is same as PGLZ_Header, which is used to store the compressed length and raw length.
+ 
+ Control byte:
+ 
+ The first byte after the header tells what to do the next 8 times. We call this the control byte.
+ 
+ 
+ History Reference:
+ 
+ A set bit in the control byte means, that a tag of 2-3 bytes follows. A tag contains information
+ to copy some bytes from old tuple version to the current location in the output.
+ 
+    Details about 2-3 byte Tag 
+    2 byte tag is used when length of History data (unchanged data from old tuple version) is less than 18.
+    3 byte tag is used when length of History data (unchanged data from old tuple version) is greater than 
+    equal to 18. 
+    The maximum length that can be represented by one Tag is 273.
+ 
+    Let's call the three tag bytes T1, T2 and T3. The position of the data to copy is coded as an offset
+    from the old tuple.
+ 
+    The offset is in the upper nibble of T1 and in T2.
+    The length is in the lower nibble of T1.
+ 
+    So the 16 bits of a 2 byte tag are coded as
+ 
+ 	7---T1--0  7---T2--0
+ 	OOOO LLLL  OOOO OOOO
+ 	
+    This limits the offset to 1-4095 (12 bits) and the length to 3-18 (4 bits) because 3 is always added to it.
+ 
+    In the actual implementation, the 2 byte tag's length is limited to 3-17, because the value 0xF 
+    in the length nibble has special meaning. It means, that the next following byte (T3) has to be 
+    added to the length value of 18. That makes total limits of 1-4095 for offset and 3-273 for length.
+ 
+ 
+ 
+ 
+ New data:
+ 
+ An unset bit in the control byte represents modified data of new tuple version.
+ First byte repersents the length [0-255] of the modified data, followed by the 
+ modified data of corresponding length.
+ 
+ 	7---T1--0  7---T2--0  ...
+ 	LLLL LLLL  DDDD DDDD  ...
+ 
+     Data bytes repeat until the length of the new data.
+ 
+ 
+ L - Length
+ O - Offset
+ D - Data
+ 
+ This encoding is very similar to LZ Compression used in PostgreSQL (pg_lzcompress.c). 
+ 
+ 
+ Encoding Mechanism for EWT
+ --------------------------
+ Copy the bitmap data from new tuple to the EWT (Encoded WAL Tuple) and loop for all attributes 
+ to find any modifications in the attributes. The unmodified data is encoded as a 
+ History Reference in EWT and the modified data (if NOT NULL) is encoded as New Data in EWT.
+  
+ The offset values are calculated with respect to the tuple t_hoff value. For each column attribute 
+ old and new tuple offsets are recalculated based on padding in the tuples.
+ Once the alignment difference is found between old and new tuple versions, 
+ then include alignment difference as New Data in EWT.
+  
+ Max encoded data length is 75% (default compression rate) of original data, if encoded output data 
+ length is greater thanthat, original tuple (new tuple version) will be directly stored in WAL Tuple.
+ 
+ 
+ Decoding Mechanism for EWT
+ --------------------------
+ Skip header and Read one control byte and process the next 8 items (or as many as remain in the compressed input). 
+ Check each control bit, if the bit is set then it is History Reference which means the next 2 - 3 byte tag 
+ provides the offset and length of history match.
+ Use the offset and corresponding length to copy data from old tuple version to new tuple.
+ If the control bit is unset, then it is New Data Reference which means first byte contains the 
+ length [0-255] of the modified data, followed by the modified data of corresponding length 
+ specified in the first byte.
+ 
+ 
+ Constraints for EWT
+ --------------------
+ 1. Delta encoding is allowed when the update is going to the same page and
+    buffer doesn't need a backup block in case of full-pagewrite is on.
+ 2. Old Tuples with length less than PGLZ_HISTORY_SIZE are allowed for encoding.
+ 3. Old and New tuple versions shouldn't vary in length by more than 50% are allowed for encoding.
+ 
+ 
  Asynchronous Commit
  -------------------
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1204,1209 **** begin:;
--- 1204,1231 ----
  }
  
  /*
+  * Determine whether the buffer referenced has to be backed up. Since we don't
+  * yet have the insert lock, fullPageWrites and forcePageWrites could change
+  * later, but will not cause any problem because this function is used only to
+  * identify whether EWT is required for WAL update.
+  */
+ bool
+ XLogCheckBufferNeedsBackup(Buffer buffer)
+ {
+ 	bool		doPageWrites;
+ 	Page		page;
+ 
+ 	page = BufferGetPage(buffer);
+ 
+ 	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+ 
+ 	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ 		return true;			/* buffer requires backup */
+ 
+ 	return false;				/* buffer does not need to be backed up */
+ }
+ 
+ /*
   * Determine whether the buffer referenced by an XLogRecData item has to
   * be backed up, and if so fill a BkpBlock struct for it.  In any case
   * save the buffer's LSN at *lsn.
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 182,190 ****
   */
  #define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
  #define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
- #define PGLZ_HISTORY_SIZE		4096
- #define PGLZ_MAX_MATCH			273
- 
  
  /* ----------
   * PGLZ_HistEntry -
--- 182,187 ----
***************
*** 302,368 **** do {									\
  			}																\
  } while (0)
  
- 
- /* ----------
-  * pglz_out_ctrl -
-  *
-  *		Outputs the last and allocates a new control byte if needed.
-  * ----------
-  */
- #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
- do { \
- 	if ((__ctrl & 0xff) == 0)												\
- 	{																		\
- 		*(__ctrlp) = __ctrlb;												\
- 		__ctrlp = (__buf)++;												\
- 		__ctrlb = 0;														\
- 		__ctrl = 1;															\
- 	}																		\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_literal -
-  *
-  *		Outputs a literal byte to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	*(_buf)++ = (unsigned char)(_byte);										\
- 	_ctrl <<= 1;															\
- } while (0)
- 
- 
- /* ----------
-  * pglz_out_tag -
-  *
-  *		Outputs a backward reference tag of 2-4 bytes (depending on
-  *		offset and length) to the destination buffer including the
-  *		appropriate control bit.
-  * ----------
-  */
- #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
- do { \
- 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
- 	_ctrlb |= _ctrl;														\
- 	_ctrl <<= 1;															\
- 	if (_len > 17)															\
- 	{																		\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
- 		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
- 		(_buf)[2] = (unsigned char)((_len) - 18);							\
- 		(_buf) += 3;														\
- 	} else {																\
- 		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
- 		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
- 		(_buf) += 2;														\
- 	}																		\
- } while (0)
- 
- 
  /* ----------
   * pglz_find_match -
   *
--- 299,304 ----
***************
*** 595,601 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			 * Create the tag and add history entries for all matched
  			 * characters.
  			 */
! 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
  			while (match_len--)
  			{
  				pglz_hist_add(hist_start, hist_entries,
--- 531,537 ----
  			 * Create the tag and add history entries for all matched
  			 * characters.
  			 */
! 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off, dp);
  			while (match_len--)
  			{
  				pglz_hist_add(hist_start, hist_entries,
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 583,620 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest.
+  *		To decompress, it uses history if provided.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 699,726 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  					break;
  				}
  
! 				/*
! 				 * Now we copy the bytes specified by the tag from OUTPUT to
! 				 * OUTPUT. It is dangerous and platform dependent to use
! 				 * memcpy() here, because the copied areas could overlap
! 				 * extremely!
! 				 */
! 				while (len--)
  				{
! 					*dp = dp[-off];
! 					dp++;
  				}
  			}
  			else
  			{
! 				/*
! 				 * An unset control bit means LITERAL BYTE. So we just copy
! 				 * one from INPUT to OUTPUT.
! 				 */
! 				if (dp >= destend)		/* check for buffer overrun */
! 					break;		/* do not clobber memory */
! 
! 				*dp++ = *sp++;
  			}
  
  			/*
--- 658,726 ----
  					break;
  				}
  
! 				if (history)
! 				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from history
! 					 * to OUTPUT.
! 					 */
! 					memcpy(dp, history + off, len);
! 					dp += len;
! 				}
! 				else
  				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from OUTPUT
! 					 * to OUTPUT. It is dangerous and platform dependent to
! 					 * use memcpy() here, because the copied areas could
! 					 * overlap extremely!
! 					 */
! 					while (len--)
! 					{
! 						*dp = dp[-off];
! 						dp++;
! 					}
  				}
  			}
  			else
  			{
! 				if (history)
! 				{
! 					/*
! 					 * The byte at current offset in the source is the length
! 					 * of this literal segment. See pglz_out_add for encoding
! 					 * side.
! 					 */
! 					int32		len;
! 
! 					len = sp[0];
! 					sp += 1;
! 
! 					if (dp + len > destend)
! 					{
! 						dp += len;
! 						break;
! 					}
! 
! 					/*
! 					 * Now we copy the bytes specified by the tag from Source
! 					 * to OUTPUT.
! 					 */
! 					memcpy(dp, sp, len);
! 					dp += len;
! 					sp += len;
! 				}
! 				else
! 				{
! 					/*
! 					 * An unset control bit means LITERAL BYTE. So we just
! 					 * copy one from INPUT to OUTPUT.
! 					 */
! 					if (dp >= destend)	/* check for buffer overrun */
! 						break;	/* do not clobber memory */
! 
! 					*dp++ = *sp++;
! 				}
  			}
  
  			/*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 123,128 **** extern int	CommitSiblings;
--- 123,129 ----
  extern char *default_tablespace;
  extern char *temp_tablespaces;
  extern bool synchronize_seqscans;
+ extern int	wal_update_compression_ratio;
  extern int	ssl_renegotiation_limit;
  extern char *SSLCipherSuites;
  
***************
*** 2382,2387 **** static struct config_int ConfigureNamesInt[] =
--- 2383,2399 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		/* Not for general use */
+ 		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ 			gettext_noop("Sets the compression ratio of delta record for wal update"),
+ 			NULL,
+ 		},
+ 		&wal_update_compression_ratio,
+ 		25, 1, 99,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 147,159 **** typedef struct xl_heap_update
  	TransactionId old_xmax;		/* xmax of the old tuple */
  	TransactionId new_xmax;		/* xmax of the new tuple */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 147,168 ----
  	TransactionId old_xmax;		/* xmax of the old tuple */
  	TransactionId new_xmax;		/* xmax of the new tuple */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
! 	int			flags;			/* flag bits, see below */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
! 														 * page's all visible
! 														 * bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
! 														 * page's all visible
! 														 * bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
! 														 * update operation is
! 														 * delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(int))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 18,23 ****
--- 18,24 ----
  #include "access/tupdesc.h"
  #include "access/tupmacs.h"
  #include "storage/bufpage.h"
+ #include "utils/pg_lzcompress.h"
  
  /*
   * MaxTupleAttributeNumber limits the number of (user) columns in a tuple.
***************
*** 579,584 **** struct MinimalTupleData
--- 580,586 ----
  		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
  
  
+ #if !defined(DISABLE_COMPLEX_MACRO)
  /* ----------------
   *		fastgetattr
   *
***************
*** 593,601 **** struct MinimalTupleData
   *		lookups, and call nocachegetattr() for the rest.
   * ----------------
   */
- 
- #if !defined(DISABLE_COMPLEX_MACRO)
- 
  #define fastgetattr(tup, attnum, tupleDesc, isnull)					\
  (																	\
  	AssertMacro((attnum) > 0),										\
--- 595,600 ----
***************
*** 623,636 **** struct MinimalTupleData
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
  )
- #else							/* defined(DISABLE_COMPLEX_MACRO) */
  
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
- 
  /* ----------------
   *		heap_getattr
   *
--- 622,677 ----
  			nocachegetattr((tup), (attnum), (tupleDesc))			\
  		)															\
  	)																\
+ )																	\
+ 
+ /* ----------------
+  *		fastgetattr_with_len
+  *
+  *		Similar to fastgetattr and fetches the length of the given attribute
+  *		also.
+  * ----------------
+  */
+ #define fastgetattr_with_len(tup, attnum, tupleDesc, isnull, len)	\
+ (																	\
+ 	AssertMacro((attnum) > 0),										\
+ 	(*(isnull) = false),											\
+ 	HeapTupleNoNulls(tup) ? 										\
+ 	(																\
+ 		(tupleDesc)->attrs[(attnum)-1]->attcacheoff >= 0 ?			\
+ 		(															\
+ 			(*(len) = att_getlength(								\
+ 					(tupleDesc)->attrs[(attnum)-1]->attlen, 		\
+ 					(char *) (tup)->t_data + (tup)->t_data->t_hoff +\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)),	\
+ 			fetchatt((tupleDesc)->attrs[(attnum)-1],				\
+ 				(char *) (tup)->t_data + (tup)->t_data->t_hoff +	\
+ 					(tupleDesc)->attrs[(attnum)-1]->attcacheoff)	\
+ 		)															\
+ 		:															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 	)																\
+ 	:																\
+ 	(																\
+ 		att_isnull((attnum)-1, (tup)->t_data->t_bits) ? 			\
+ 		(															\
+ 			(*(isnull) = true), 									\
+ 			(*(len) = 0),											\
+ 			(Datum)NULL 											\
+ 		)															\
+ 		:															\
+ 		(															\
+ 			nocachegetattr_with_len((tup), (attnum), (tupleDesc), (len))\
+ 		)															\
+ 	)																\
  )
  
+ #else							/* defined(DISABLE_COMPLEX_MACRO) */
  extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
  			bool *isnull);
+ extern Datum fastgetattr_with_len(HeapTuple tup, int attnum,
+ 			TupleDesc tupleDesc, bool *isnull, int32 *len);
  #endif   /* defined(DISABLE_COMPLEX_MACRO) */
  
  /* ----------------
   *		heap_getattr
   *
***************
*** 647,667 **** extern Datum fastgetattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
  	( \
! 		((attnum) > 0) ? \
  		( \
! 			((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
! 			( \
! 				(*(isnull) = true), \
! 				(Datum)NULL \
! 			) \
! 			: \
! 				fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
  		) \
  		: \
! 			heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	)
  
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
--- 688,730 ----
   * ----------------
   */
  #define heap_getattr(tup, attnum, tupleDesc, isnull) \
+ ( \
+ 	((attnum) > 0) ? \
  	( \
! 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
  		( \
! 			(*(isnull) = true), \
! 			(Datum)NULL \
  		) \
  		: \
! 			fastgetattr((tup), (attnum), (tupleDesc), (isnull)) \
! 	) \
! 	: \
! 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
! )
  
+ /* ----------------
+  *		heap_getattr_with_len
+  *
+  *		Similar to heap_getattr and outputs the length of the given attribute.
+  * ----------------
+  */
+ #define heap_getattr_with_len(tup, attnum, tupleDesc, isnull, len) \
+ ( \
+ 	((attnum) > 0) ? \
+ 	( \
+ 		((attnum) > (int) HeapTupleHeaderGetNatts((tup)->t_data)) ? \
+ 		( \
+ 			(*(isnull) = true), \
+ 			(*(len) = 0), \
+ 			(Datum)NULL \
+ 		) \
+ 		: \
+ 			fastgetattr_with_len((tup), (attnum), (tupleDesc), (isnull), (len)) \
+ 	) \
+ 	: \
+ 		heap_getsysattr((tup), (attnum), (tupleDesc), (isnull)) \
+ )
  
  /* prototypes for functions in common/heaptuple.c */
  extern Size heap_compute_data_size(TupleDesc tupleDesc,
***************
*** 671,676 **** extern void heap_fill_tuple(TupleDesc tupleDesc,
--- 734,741 ----
  				char *data, Size data_size,
  				uint16 *infomask, bits8 *bit);
  extern bool heap_attisnull(HeapTuple tup, int attnum);
+ extern Datum nocachegetattr_with_len(HeapTuple tup, int attnum,
+ 			   TupleDesc att, Size *len);
  extern Datum nocachegetattr(HeapTuple tup, int attnum,
  			   TupleDesc att);
  extern Datum heap_getsysattr(HeapTuple tup, int attnum, TupleDesc tupleDesc,
***************
*** 687,692 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 752,765 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_attr_get_length_and_check_equals(TupleDesc tupdesc,
+ 						int attrnum, HeapTuple tup1, HeapTuple tup2,
+ 						Size *tup1_attr_len, Size *tup2_attr_len);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				HeapTuple newtup, PGLZ_Header *encdata);
+ extern void heap_delta_decode (PGLZ_Header *encdata, HeapTuple oldtup,
+ 				HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/tupmacs.h
--- b/src/include/access/tupmacs.h
***************
*** 187,192 ****
--- 187,214 ----
  )
  
  /*
+  * att_getlength -
+  *			Gets the length of the attribute.
+  */
+ #define att_getlength(attlen, attptr) \
+ ( \
+ 	((attlen) > 0) ? \
+ 	( \
+ 		(attlen) \
+ 	) \
+ 	: (((attlen) == -1) ? \
+ 	( \
+ 		VARSIZE_ANY(attptr) \
+ 	) \
+ 	: \
+ 	( \
+ 		AssertMacro((attlen) == -2), \
+ 		(strlen((char *) (attptr)) + 1) \
+ 	)) \
+ )
+ 
+ 
+ /*
   * store_att_byval is a partial inverse of fetch_att: store a given Datum
   * value into a tuple data area at the specified address.  However, it only
   * handles the byval case, because in typical usage the caller needs to
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 261,266 **** typedef struct CheckpointStatsData
--- 261,267 ----
  extern CheckpointStatsData CheckpointStats;
  
  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+ extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
  extern void XLogFlush(XLogRecPtr RecPtr);
  extern bool XLogBackgroundFlush(void);
  extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 23,28 **** typedef struct PGLZ_Header
--- 23,31 ----
  	int32		rawsize;
  } PGLZ_Header;
  
+ /* LZ algorithm can hold only history offset in the range of 1 - 4095. */
+ #define PGLZ_HISTORY_SIZE		4096
+ #define PGLZ_MAX_MATCH			273
  
  /* ----------
   * PGLZ_MAX_OUTPUT -
***************
*** 86,91 **** typedef struct PGLZ_Strategy
--- 89,207 ----
  	int32		match_size_drop;
  } PGLZ_Strategy;
  
+ /*
+  * calculate the approximate length required for history reference tag for the
+  * given length
+  */
+ #define PGLZ_GET_HIST_CTRL_BIT_LEN(_len) \
+ ( \
+ 	((_len) < 17) ?	(3) : (4 * (1 + ((_len) / PGLZ_MAX_MATCH)))	\
+ )
+ 
+ /* ----------
+  * pglz_out_ctrl -
+  *
+  *		Outputs the last and allocates a new control byte if needed.
+  * ----------
+  */
+ #define pglz_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+ do { \
+ 	if ((__ctrl & 0xff) == 0)												\
+ 	{																		\
+ 		*(__ctrlp) = __ctrlb;												\
+ 		__ctrlp = (__buf)++;												\
+ 		__ctrlb = 0;														\
+ 		__ctrl = 1; 														\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_literal -
+  *
+  *		Outputs a literal byte to the destination buffer including the
+  *		appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+ do { \
+ 	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+ 	*(_buf)++ = (unsigned char)(_byte); 									\
+ 	_ctrl <<= 1;															\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_tag -
+  *
+  *		Outputs a backward/history reference tag of 2-3 bytes (depending on
+  *		offset and length) to the destination buffer including the
+  *		appropriate control bit.
+  *
+  *		Split the process of backward/history reference as different chunks,
+  *		if the given length is more than max match and repeats the process
+  *		until the given length is processed.
+  *
+  *		If the matched history length is less than 3 bytes then add it as a
+  *		new data only during encoding instead of history reference. This occurs
+  *		only while framing EWT.
+  * ----------
+  */
+ #define pglz_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_byte) \
+ do { \
+ 	int _mtaglen;															\
+ 	int _tagtotal_len = (_len);												\
+ 	while (_tagtotal_len > 0)												\
+ 	{																		\
+ 		_mtaglen = _tagtotal_len > PGLZ_MAX_MATCH ? PGLZ_MAX_MATCH : _tagtotal_len;	\
+ 		if (_mtaglen < 3)													\
+ 		{																	\
+ 			char *_data = (char *)(_byte) + (_off);							\
+ 			pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_mtaglen,_data);			\
+ 			break;															\
+ 		}																	\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);							\
+ 		_ctrlb |= _ctrl;													\
+ 		_ctrl <<= 1;														\
+ 		if (_mtaglen > 17)													\
+ 		{																	\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);	\
+ 			(_buf)[1] = (unsigned char)(((_off) & 0xff));					\
+ 			(_buf)[2] = (unsigned char)((_mtaglen) - 18);					\
+ 			(_buf) += 3;													\
+ 		} else {															\
+ 			(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_mtaglen) - 3)); \
+ 			(_buf)[1] = (unsigned char)((_off) & 0xff);						\
+ 			(_buf) += 2;													\
+ 		}																	\
+ 		_tagtotal_len -= _mtaglen;											\
+ 		(_off) += _mtaglen;													\
+ 	}																		\
+ } while (0)
+ 
+ /* ----------
+  * pglz_out_add -
+  *
+  *		Outputs a reference tag of 1 byte with length and the new data
+  *		to the destination buffer, including the appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ 	int32 _maddlen;															\
+ 	int32 _addtotal_len = (_len);											\
+ 	while (_addtotal_len > 0)												\
+ 	{																		\
+ 		_maddlen = _addtotal_len > 255 ? 255 : _addtotal_len;				\
+ 		pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);							\
+ 		_ctrl <<= 1;														\
+ 		(_buf)[0] = (unsigned char)(_maddlen);								\
+ 		(_buf) += 1;														\
+ 		memcpy((_buf), (_byte), _maddlen);									\
+ 		(_buf) += _maddlen;													\
+ 		(_byte) += _maddlen;												\
+ 		_addtotal_len -= _maddlen;											\
+ 	}																		\
+ } while (0)
+ 
  
  /* ----------
   * The standard strategies
***************
*** 108,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! 
  #endif   /* _PG_LZCOMPRESS_H_ */
--- 224,229 ----
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
! extern void pglz_decompress_with_history(const char *source, char *dest,
! 				uint32 *destlen, const char *history);
  #endif   /* _PG_LZCOMPRESS_H_ */
*** a/src/test/regress/expected/update.out
--- b/src/test/regress/expected/update.out
***************
*** 97,99 **** SELECT a, b, char_length(c) FROM update_test;
--- 97,169 ----
  (2 rows)
  
  DROP TABLE update_test;
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ DROP TABLE IF EXISTS update_test;
+ NOTICE:  table "update_test" does not exist, skipping
+ CREATE TABLE update_test (
+ 		bser bigserial,
+ 		bln boolean,
+ 		ename VARCHAR(25),
+ 		perf_f float(8),
+ 		grade CHAR,
+ 		dept CHAR(5) NOT NULL,
+ 		dob DATE,
+ 		idnum INT,
+ 		addr VARCHAR(30) NOT NULL,
+ 		destn CHAR(6),
+ 		Gend CHAR,
+ 		samba BIGINT,
+ 		hgt float,
+ 		ctime TIME
+ );
+ INSERT INTO update_test VALUES (
+ 		nextval('update_test_bser_seq'::regclass),
+ 		TRUE,
+ 		'Test',
+ 		7.169,
+ 		'B',
+ 		'CSD',
+ 		'2000-01-01',
+ 		520,
+ 		'road2,
+ 		streeeeet2,
+ 		city2',
+ 		'dcy2',
+ 		'M',
+ 		12000,
+ 		50.4,
+ 		'00:00:00.0'
+ );
+ SELECT * from update_test;
+  bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+     1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+       |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+       |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+ (1 row)
+ 
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ SELECT * from update_test;
+  bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+     1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+       |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+       |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+ (1 row)
+ 
+ DROP TABLE update_test;
*** a/src/test/regress/sql/update.sql
--- b/src/test/regress/sql/update.sql
***************
*** 59,61 **** UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
--- 59,128 ----
  SELECT a, b, char_length(c) FROM update_test;
  
  DROP TABLE update_test;
+ 
+ 
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ 
+ DROP TABLE IF EXISTS update_test;
+ CREATE TABLE update_test (
+ 		bser bigserial,
+ 		bln boolean,
+ 		ename VARCHAR(25),
+ 		perf_f float(8),
+ 		grade CHAR,
+ 		dept CHAR(5) NOT NULL,
+ 		dob DATE,
+ 		idnum INT,
+ 		addr VARCHAR(30) NOT NULL,
+ 		destn CHAR(6),
+ 		Gend CHAR,
+ 		samba BIGINT,
+ 		hgt float,
+ 		ctime TIME
+ );
+ 
+ INSERT INTO update_test VALUES (
+ 		nextval('update_test_bser_seq'::regclass),
+ 		TRUE,
+ 		'Test',
+ 		7.169,
+ 		'B',
+ 		'CSD',
+ 		'2000-01-01',
+ 		520,
+ 		'road2,
+ 		streeeeet2,
+ 		city2',
+ 		'dcy2',
+ 		'M',
+ 		12000,
+ 		50.4,
+ 		'00:00:00.0'
+ );
+ 
+ SELECT * from update_test;
+ 
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ 
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ 
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ 
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ 
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ 
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ 
+ SELECT * from update_test;
+ DROP TABLE update_test;

Heikki Linnakangas

hlinnakangas@vmware.com

almost 13 years ago

In reply to: Amit Kapila (#3)

1 attachment(s)

On 28.01.2013 15:39, Amit Kapila wrote:

Rebased the patch as per HEAD.

I don't like the way heap_delta_encode has intimate knowledge of how the
lz compression works. It feels like a violent punch through the
abstraction layers.

Ideally, you would just pass the old and new tuple to pglz as char *,
and pglz code would find the common parts. But I guess that's too slow,
as that's what I originally suggested and you rejected that approach.
But even if that's not possible on performance grounds, we don't need to
completely blow up the abstraction. pglz can still do the encoding - the
caller just needs to pass it the attribute boundaries to consider for
matches, so that it doesn't need to scan them byte by byte.

I came up with the attached patch. I wrote it to demonstrate the API,
I'm not 100% sure the result after decoding is correct.

- Heikki

Attachments:

wal_update_pglz_with_history-heikki.patchtext/x-diff; name=wal_update_pglz_with_history-heikki.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..bbdee4f 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
 
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
 #define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,119 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata)
+{
+	HeapTupleHeader tup = oldtup->t_data;
+	Form_pg_attribute *att = tupleDesc->attrs;
+	bool		hasnulls = HeapTupleHasNulls(oldtup);
+	bits8	   *bp = oldtup->t_data->t_bits;		/* ptr to null bitmap in tuple */
+	bool		slow = false;	/* can we use/set attcacheoff? */
+	char	   *tp;				/* ptr to tuple data */
+	long		off;			/* offset in tuple data */
+	int			natts;
+	int32	   *offsets;
+	int			noffsets;
+	int			attnum;
+	PGLZ_Strategy strategy;
+
+	/*
+	 * Loop through all attributes, if the attribute is modified by the update
+	 * operation, store the [Offset,Length] reffering old tuple version till
+	 * the last unchanged column in the EWT as History Reference, else store
+	 * the [Length,Data] from new tuple version as New Data.
+	 */
+	natts = HeapTupleHeaderGetNatts(oldtup->t_data);
+
+	offsets = palloc(natts * sizeof(int32));
+
+	noffsets = 0;
+
+	/* copied from heap_deform_tuple */
+	tp = (char *) tup + tup->t_hoff;
+	off = 0;
+	for (attnum = 0; attnum < natts; attnum++)
+	{
+		Form_pg_attribute thisatt = att[attnum];
+
+		if (hasnulls && att_isnull(attnum, bp))
+		{
+			slow = true;		/* can't use attcacheoff anymore */
+			continue;
+		}
+
+		if (!slow && thisatt->attcacheoff >= 0)
+			off = thisatt->attcacheoff;
+		else if (thisatt->attlen == -1)
+		{
+			/*
+			 * We can only cache the offset for a varlena attribute if the
+			 * offset is already suitably aligned, so that there would be no
+			 * pad bytes in any case: then the offset will be valid for either
+			 * an aligned or unaligned value.
+			 */
+			if (!slow &&
+				off == att_align_nominal(off, thisatt->attalign))
+				thisatt->attcacheoff = off;
+			else
+			{
+				off = att_align_pointer(off, thisatt->attalign, -1,
+										tp + off);
+				slow = true;
+			}
+		}
+		else
+		{
+			/* not varlena, so safe to use att_align_nominal */
+			off = att_align_nominal(off, thisatt->attalign);
+
+			if (!slow)
+				thisatt->attcacheoff = off;
+		}
+
+		off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+
+		if (thisatt->attlen <= 0)
+			slow = true;		/* can't use attcacheoff anymore */
+
+		offsets[noffsets++] = off;
+	}
+
+	strategy = *PGLZ_strategy_always;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_compress_with_history((char *) oldtup->t_data, oldtup->t_len,
+									  (char *) newtup->t_data, newtup->t_len,
+									  offsets, noffsets, (PGLZ_Header *) encdata,
+									  &strategy);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_decompress_with_history((char *) encdata,
+										newtup->t_data,
+										&newtup->t_len,
+										(char *) oldtup->t_data,
+										oldtup->t_len);
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57d47e8..789bbe2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
@@ -5765,6 +5766,16 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	struct
+	{
+		PGLZ_Header pglzheader;
+		char		buf[MaxHeapTupleSize];
+	}			buf;
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -5774,15 +5785,46 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole bolck in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if ((oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, (char *) &buf.pglzheader))
+		{
+			compressed = true;
+			newtupdata = (char *) &buf.pglzheader;
+			newtuplen = VARSIZE(&buf.pglzheader);
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -5809,9 +5851,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6614,7 +6659,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -6629,7 +6677,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6689,7 +6737,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6707,7 +6755,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -6732,7 +6780,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6795,10 +6843,32 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + New data (1 byte length + variable data)+ ...
+		 */
+		PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
+
+		oldtup.t_data = oldtupdata;
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) encoded_data, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -6814,7 +6884,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cf2f6e7..9cd6271 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1204,6 +1204,28 @@ begin:;
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..c6ba6af 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -373,6 +373,7 @@ do { \
  */
 static inline int
 pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
+				const char *historyend,
 				int *lenp, int *offp, int good_match, int good_drop)
 {
 	PGLZ_HistEntry *hent;
@@ -393,7 +394,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
+		thisoff = (historyend ? historyend : ip) - hp;
 		if (thisoff >= 0x0fff)
 			break;
 
@@ -408,12 +409,12 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		thislen = 0;
 		if (len >= 16)
 		{
-			if (memcmp(ip, hp, len) == 0)
+			if ((historyend == NULL || historyend - hp > len) && memcmp(ip, hp, len) == 0)
 			{
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH && (historyend == NULL || hp < historyend))
 				{
 					thislen++;
 					ip++;
@@ -423,7 +424,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH && (historyend == NULL || hp < historyend))
 			{
 				thislen++;
 				ip++;
@@ -588,7 +589,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
-		if (pglz_find_match(hist_start, dp, dend, &match_len,
+		if (pglz_find_match(hist_start, dp, dend, NULL, &match_len,
 							&match_off, good_match, good_drop))
 		{
 			/*
@@ -637,6 +638,176 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Like pglz_compress, but performs delta encoding rather than compression.
+ * The back references are offsets from the end of history data, rather
+ * than current output position. 'hoffsets' is an array of offsets in the
+ * history to consider. We could scan the whole history string for possible
+ * matches, but if the caller has some information on which offsets are
+ * likely to be interesting (attribute boundaries, when encoding tuples, for
+ * example), this is a lot faster.
+ */
+bool
+pglz_compress_with_history(const char *source, int32 slen, const char *history,
+						   int32 hlen,
+						   int32 *hoffsets,
+						   int32 nhoffsets,
+						   PGLZ_Header *dest, const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
+	unsigned char *bstart = bp;
+	int			hist_next = 0;
+	bool		hist_recycle = false;
+	const char *dp = source;
+	const char *dend = source + slen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len;
+	int32		match_off;
+	int32		good_match;
+	int32		good_drop;
+	int32		result_size;
+	int32		result_max;
+	int			i;
+	int32		need_rate;
+	const char *historyend = history + hlen;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	/*
+	 * Save the original source size in the header.
+	 */
+	dest->rawsize = slen;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, sizeof(hist_start));
+
+	/* Populate the history hash from the history string */
+	for (i = 0; i < nhoffsets; i++)
+	{
+		const char *hp = history + hoffsets[i];
+
+		/* Add this offset to history */
+		pglz_hist_add(hist_start, hist_entries,
+					  hist_next, hist_recycle,
+					  hp, historyend);
+	}
+
+	/*
+	 * Compress the source directly into the output buffer.
+	 */
+	dp = source;
+	while (dp < dend)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		if (pglz_find_match(hist_start, dp, dend, historyend, &match_len,
+							&match_off, good_match, good_drop))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			dp += match_len;
+			found_match = true;
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.	*/
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	SET_VARSIZE_COMPRESSED(dest, result_size + sizeof(PGLZ_Header));
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -647,15 +818,39 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 void
 pglz_decompress(const PGLZ_Header *source, char *dest)
 {
+	pglz_decompress_with_history((char *) source, dest, NULL, NULL, 0);
+}
+
+/* ----------
+ * pglz_decompress_with_history -
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+							 const char *history, int hlen)
+{
+	PGLZ_Header src;
 	const unsigned char *sp;
 	const unsigned char *srcend;
 	unsigned char *dp;
 	unsigned char *destend;
+	const char *historyend = history + hlen;
+
+	/* To avoid the unaligned access of PGLZ_Header */
+	memcpy((char *) &src, source, sizeof(PGLZ_Header));
 
 	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
-	srcend = ((const unsigned char *) source) + VARSIZE(source);
+	srcend = ((const unsigned char *) source) + VARSIZE(&src);
 	dp = (unsigned char *) dest;
-	destend = dp + source->rawsize;
+	destend = dp + src.rawsize;
+
+	if (destlen)
+	{
+		*destlen = src.rawsize;
+	}
 
 	while (sp < srcend && dp < destend)
 	{
@@ -699,26 +894,38 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 					break;
 				}
 
-				/*
-				 * Now we copy the bytes specified by the tag from OUTPUT to
-				 * OUTPUT. It is dangerous and platform dependent to use
-				 * memcpy() here, because the copied areas could overlap
-				 * extremely!
-				 */
-				while (len--)
+				if (history)
+				{
+					/*
+					 * Now we copy the bytes specified by the tag from history
+					 * to OUTPUT.
+					 */
+					memcpy(dp, historyend - off, len);
+					dp += len;
+				}
+				else
 				{
-					*dp = dp[-off];
-					dp++;
+					/*
+					 * Now we copy the bytes specified by the tag from OUTPUT
+					 * to OUTPUT. It is dangerous and platform dependent to
+					 * use memcpy() here, because the copied areas could
+					 * overlap extremely!
+					 */
+					while (len--)
+					{
+						*dp = dp[-off];
+						dp++;
+					}
 				}
 			}
 			else
 			{
 				/*
-				 * An unset control bit means LITERAL BYTE. So we just copy
-				 * one from INPUT to OUTPUT.
+				 * An unset control bit means LITERAL BYTE. So we just
+				 * copy one from INPUT to OUTPUT.
 				 */
-				if (dp >= destend)		/* check for buffer overrun */
-					break;		/* do not clobber memory */
+				if (dp >= destend)	/* check for buffer overrun */
+					break;	/* do not clobber memory */
 
 				*dp++ = *sp++;
 			}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6128694..9a37b2d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool synchronize_seqscans;
+extern int	wal_update_compression_ratio;
 extern int	ssl_renegotiation_limit;
 extern char *SSLCipherSuites;
 
@@ -2382,6 +2383,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 1, 99,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..1825292 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
 	TransactionId old_xmax;		/* xmax of the old tuple */
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
-	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
+	int			flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+														 * update operation is
+														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(int))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..042c8b9 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				HeapTuple newtup, char *encdata);
+extern void heap_delta_decode (char *encdata, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 72e3242..15f5d5d 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..7a32803 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,8 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_compress_with_history(const char *source, int32 slen, const char *history, int32 hlen, int32 *hoffsets, int32 noffsets, PGLZ_Header *dest, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen, const char *history, int hlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

Amit Kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Heikki Linnakangas (#4)

On Tuesday, January 29, 2013 2:53 AM Heikki Linnakangas wrote:

On 28.01.2013 15:39, Amit Kapila wrote:

Rebased the patch as per HEAD.

I don't like the way heap_delta_encode has intimate knowledge of how
the lz compression works. It feels like a violent punch through the
abstraction layers.

Ideally, you would just pass the old and new tuple to pglz as char *,
and pglz code would find the common parts. But I guess that's too slow,
as that's what I originally suggested and you rejected that approach.
But even if that's not possible on performance grounds, we don't need
to completely blow up the abstraction. pglz can still do the encoding -
the caller just needs to pass it the attribute boundaries to consider
for matches, so that it doesn't need to scan them byte by byte.

I came up with the attached patch. I wrote it to demonstrate the API,
I'm not 100% sure the result after decoding is correct.

I have checked the patch code, found few problems.

1. History will be old tuple, in that case below call needs to be changed
/* return pglz_compress_with_history((char *) oldtup->t_data,
oldtup->t_len,

(char *) newtup->t_data, newtup->t_len,

offsets, noffsets, (PGLZ_Header *) encdata,

&strategy);*/
return pglz_compress_with_history((char *) newtup->t_data,
newtup->t_len,

(char *) oldtup->t_data, oldtup->t_len,

offsets, noffsets, (PGLZ_Header *) encdata,

&strategy);
2. The offset array is beginning of each column offset. In that case below
needs to be changed.

offsets[noffsets++] = off;

off = att_addlength_pointer(off, thisatt->attlen, tp + off);

if (thisatt->attlen <= 0)
slow = true; /* can't use attcacheoff
anymore */

// offsets[noffsets++] = off;
}

Apart from this, some of the test cases are failing which I need to check.

I have debugged the new code, it appears to me that this will not be as
efficient as the current approach of patch.
It needs to build hash table for history reference and comparison which can
add overhead as compare to existing approach. I am taking the Performance
and WAL Reduction data.

Can there be another way with which current patch code can be made better,
so that we don't need to change the encoding approach, as I am having
feeling that this might not be performance wise equally good.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

almost 13 years ago

In reply to: Amit Kapila (#5)

On 29.01.2013 11:58, Amit Kapila wrote:

Can there be another way with which current patch code can be made better,
so that we don't need to change the encoding approach, as I am having
feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch more
like yours in behavior by also passing an array of offsets in the new
tuple to check, and only checking for matches as those offsets.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Heikki Linnakangas (#6)

On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:

On 29.01.2013 11:58, Amit Kapila wrote:

Can there be another way with which current patch code can be made

better,

so that we don't need to change the encoding approach, as I am having
feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch more
like yours in behavior by also passing an array of offsets in the new
tuple to check, and only checking for matches as those offsets.

I think it makes sense, because if we have offsets of both new and old
tuple, we
can internally use memcmp to compare columns and use same algorithm for
encoding.
I will change the patch according to this suggestion.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Amit Kapila (#7)

1 attachment(s)

On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:

On 29.01.2013 11:58, Amit Kapila wrote:

Can there be another way with which current patch code can be made

better,

so that we don't need to change the encoding approach, as I am

having

feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch more
like yours in behavior by also passing an array of offsets in the new
tuple to check, and only checking for matches as those offsets.

I think it makes sense, because if we have offsets of both new and old
tuple, we
can internally use memcmp to compare columns and use same algorithm for
encoding.
I will change the patch according to this suggestion.

I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed bitmaplength
also, as we need
to copy the bitmap of new tuple as it is into Encoded WAL Tuple.

Please see if such API design is okay?

I shall update the README and send the performance/WAL Reduction data for
modified patch tomorrow.

With Regards,
Amit Kapila.

Attachments:

wal_update_changes_v10.patchapplication/octet-stream; name=wal_update_changes_v10.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,66 ****
--- 60,70 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
+ #include "utils/pg_lzcompress.h"
  
+ /* guc variable for EWT compression ratio*/
+ int			wal_update_compression_ratio = 25;
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
  #define ATT_IS_PACKABLE(att) \
***************
*** 69,74 ****
--- 73,80 ----
  #define VARLENA_ATT_IS_PACKABLE(att) \
  	((att)->attstorage != 'p')
  
+ static void heap_get_attr_offsets (TupleDesc tupleDesc, HeapTuple Tuple,
+   							  	  int32 **offsets, int *noffsets);
  
  /* ----------------------------------------------------------------
   *						misc support routines
***************
*** 617,622 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 623,766 ----
  	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
  }
  
+ /* ----------------
+  * heap_get_attr_offsets
+  *
+  *		Given a tuple, extract it's column starting offsets including null
+  *	columns also. For null columns the offset will be same as next attribute
+  *  offset.
+  * ----------------
+  */
+ static void
+ heap_get_attr_offsets (TupleDesc tupleDesc, HeapTuple Tuple,
+   							  	  int32 **offsets, int *noffsets)
+ {
+ 	HeapTupleHeader tup = Tuple->t_data;
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	bool		hasnulls = HeapTupleHasNulls(Tuple);
+ 	bits8	   *bp = Tuple->t_data->t_bits;		/* ptr to null bitmap in tuple */
+ 	bool		slow = false;	/* can we use/set attcacheoff? */
+ 	char	   *tp;				/* ptr to tuple data */
+ 	long		off;			/* offset in tuple data */
+ 	int			natts;
+ 	int			attnum;
+ 
+ 	natts = HeapTupleHeaderGetNatts(Tuple->t_data);
+ 
+ 	*offsets = palloc(natts * sizeof(int32));
+ 
+ 	*noffsets = 0;
+ 
+ 	/* copied from heap_deform_tuple */
+ 	tp = (char *) tup + tup->t_hoff;
+ 	off = 0;
+ 	for (attnum = 0; attnum < natts; attnum++)
+ 	{
+ 		Form_pg_attribute thisatt = att[attnum];
+ 
+ 		if (hasnulls && att_isnull(attnum, bp))
+ 		{
+ 			slow = true;		/* can't use attcacheoff anymore */
+ 			(*offsets)[(*noffsets)++] = off;
+ 			continue;
+ 		}
+ 
+ 		if (!slow && thisatt->attcacheoff >= 0)
+ 			off = thisatt->attcacheoff;
+ 		else if (thisatt->attlen == -1)
+ 		{
+ 			/*
+ 			 * We can only cache the offset for a varlena attribute if the
+ 			 * offset is already suitably aligned, so that there would be no
+ 			 * pad bytes in any case: then the offset will be valid for either
+ 			 * an aligned or unaligned value.
+ 			 */
+ 			if (!slow &&
+ 				off == att_align_nominal(off, thisatt->attalign))
+ 				thisatt->attcacheoff = off;
+ 			else
+ 			{
+ 				off = att_align_pointer(off, thisatt->attalign, -1,
+ 										tp + off);
+ 				slow = true;
+ 			}
+ 		}
+ 		else
+ 		{
+ 			/* not varlena, so safe to use att_align_nominal */
+ 			off = att_align_nominal(off, thisatt->attalign);
+ 
+ 			if (!slow)
+ 				thisatt->attcacheoff = off;
+ 		}
+ 
+ 		(*offsets)[(*noffsets)++] = off;
+ 
+ 		off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ 
+ 		if (thisatt->attlen <= 0)
+ 			slow = true;		/* can't use attcacheoff anymore */
+ 
+ 	}
+ 
+ }
+ 
+ /* ----------------
+  * heap_delta_encode
+  *
+  *		Calculate the delta between two tuples, using pglz. The result is
+  * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+  * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  char *encdata)
+ {
+ 	int32	   *hoffsets,
+ 			   *newoffsets;
+ 	int			noffsets;
+ 	PGLZ_Strategy strategy;
+ 	int32 		newbitmaplen,
+ 				hbitmpalen;
+ 
+ 	newbitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	hbitmpalen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * Deform and get the old and new tuple column boundary offsets. Which are
+ 	 * required for calculating delta between old and new tuples.
+ 	 */
+ 	heap_get_attr_offsets(tupleDesc, oldtup, &hoffsets, &noffsets);
+ 	heap_get_attr_offsets(tupleDesc, newtup, &newoffsets, &noffsets);
+ 
+ 	strategy = *PGLZ_strategy_always;
+ 	strategy.min_comp_rate = wal_update_compression_ratio;
+ 
+ 	return pglz_compress_with_history((char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 									  newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ 									  (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 									  oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ 									  newoffsets, hoffsets, noffsets,
+ 									  newbitmaplen, hbitmpalen,
+ 									  (PGLZ_Header *) encdata, &strategy);
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *
+  *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+  * ----------------
+  */
+ void
+ heap_delta_decode(char *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	return pglz_decompress_with_history((char *) encdata,
+ 										(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 										&newtup->t_len,
+ 										(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
  /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 70,75 ****
--- 70,76 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
+ #include "utils/pg_lzcompress.h"
  
  
  /* GUC variable */
***************
*** 5765,5770 **** log_heap_update(Relation reln, Buffer oldbuf,
--- 5766,5781 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds EWT */
+ 	struct
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[MaxHeapTupleSize];
+ 	}			buf;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 5774,5788 **** log_heap_update(Relation reln, Buffer oldbuf,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = oldtup->t_self;
  	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
  	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
  											  oldtup->t_data->t_infomask2);
  	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 5785,5830 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+  	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * EWT can be generated for all new tuple versions created by Update
+ 	 * operation. Currently we do it when both the old and new tuple versions
+ 	 * are on same page, because during recovery if the page containing old
+ 	 * tuple is corrupt, it should not cascade that corruption to other pages.
+ 	 * Under the general assumption that for long runs most updates tend to
+ 	 * create new tuple version on same page, there should not be significant
+ 	 * impact on WAL reduction or performance.
+ 	 *
+ 	 * We should not generate EWT when we need to backup the whole bolck in
+ 	 * WAL as in that case there is no saving by reduced WAL size.
+ 	 */
+ 	if ((oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ 	{
+ 		/* Delta-encode the new tuple using the old tuple */
+ 		if (heap_delta_encode(reln->rd_att, oldtup, newtup, (char *) &buf.pglzheader))
+ 		{
+ 			compressed = true;
+ 			newtupdata = (char *) &buf.pglzheader;
+ 			newtuplen = VARSIZE(&buf.pglzheader);
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = oldtup->t_self;
  	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
  	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
  											  oldtup->t_data->t_infomask2);
  	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 5809,5817 **** log_heap_update(Relation reln, Buffer oldbuf,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 5851,5862 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! 	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 6614,6620 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 6659,6668 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 6629,6635 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 6677,6683 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 6689,6695 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
  	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
--- 6737,6743 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
  	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
***************
*** 6707,6713 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 6755,6761 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 6732,6738 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 6780,6786 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 6795,6804 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 6843,6874 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the record is EWT then decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/*
! 		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
! 		 * + New data (1 byte length + variable data)+ ...
! 		 */
! 		PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
! 
! 		oldtup.t_data = oldtupdata;
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode((char *) encoded_data, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 6814,6820 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 6884,6890 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1209,1214 **** begin:;
--- 1209,1236 ----
  }
  
  /*
+  * Determine whether the buffer referenced has to be backed up. Since we don't
+  * yet have the insert lock, fullPageWrites and forcePageWrites could change
+  * later, but will not cause any problem because this function is used only to
+  * identify whether EWT is required for WAL update.
+  */
+ bool
+ XLogCheckBufferNeedsBackup(Buffer buffer)
+ {
+ 	bool		doPageWrites;
+ 	Page		page;
+ 
+ 	page = BufferGetPage(buffer);
+ 
+ 	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+ 
+ 	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ 		return true;			/* buffer requires backup */
+ 
+ 	return false;				/* buffer does not need to be backed up */
+ }
+ 
+ /*
   * Determine whether the buffer referenced by an XLogRecData item has to
   * be backed up, and if so fill a BkpBlock struct for it.  In any case
   * save the buffer's LSN at *lsn.
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 471,476 **** pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
--- 471,516 ----
  	return 0;
  }
  
+ /* ----------
+  * pglz_find_match -
+  *
+  *		Lookup the history table if the actual input stream matches
+  *		another sequence of characters, starting somewhere earlier
+  *		in the input buffer.
+  * ----------
+  */
+ static inline int
+ pglz_find_match_with_history(const char *input, const char *end,
+ 				const char *history, const char *hend, int *lenp)
+ {
+ 	const char *ip = input;
+ 	const char *hp = history;
+ 
+ 	/*
+ 	 * Determine length of match. A better match must be larger than the
+ 	 * best so far. And if we already have a match of 16 or more bytes,
+ 	 * it's worth the call overhead to use memcmp() to check if this match
+ 	 * is equal for the same size. After that we must fallback to
+ 	 * character by character comparison to know the exact position where
+ 	 * the diff occurred.
+ 	 */
+ 	while (ip < end && hp < hend && *ip == *hp && *lenp < PGLZ_MAX_MATCH)
+ 	{
+ 		(*lenp)++;
+ 		ip++;
+ 		hp++;
+ 	}
+ 
+ 	/*
+ 	 * Return match information only if it results at least in one byte
+ 	 * reduction.
+ 	 */
+ 	if (*lenp > 2)
+ 		return 1;
+ 
+ 	return 0;
+ }
+ 
  
  /* ----------
   * pglz_compress -
***************
*** 637,642 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
--- 677,879 ----
  	return true;
  }
  
+ /* ----------
+  * pglz_compress_with_history
+  *
+  * Like pglz_compress, but performs delta encoding rather than compression.
+  * The references are offsets from the start of history data, rather
+  * than current output position. 'hoffsets' and 'newoffsets' are array of
+  * offsets in the history and source to consider. We could scan the history
+  * string for possible matches on which offsets are likely to be interesting
+  * (attribute boundaries, when encoding tuples, for example), this is a lot
+  * faster.
+  * For attributes having NULL value, the offset will be same as next attribute
+  * offset. When old tuple contains NULL and new tuple has non-NULL value,
+  * it will copy it as New Data in Encoded WAL Tuple. When new tuple has NULL
+  * value and old tuple has non-NULL value, the old tuple value will be ignored.
+  * ----------
+  */
+ bool
+ pglz_compress_with_history(const char *source, int32 slen,
+ 								   const char *history, int32 hlen,
+ 								   int32 *newoffsets, int32 *hoffsets, int32 noffsets,
+ 								   int32 newbitmaplen, int32 hbitmaplen,
+ 								   PGLZ_Header *dest, const PGLZ_Strategy *strategy)
+ {
+ 	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
+ 	unsigned char *bstart = bp;
+ 	const char *dp = source;
+ 	const char *dend = source + slen;
+ 	unsigned char ctrl_dummy = 0;
+ 	unsigned char *ctrlp = &ctrl_dummy;
+ 	unsigned char ctrlb = 0;
+ 	unsigned char ctrl = 0;
+ 	bool		found_match = false;
+ 	int32		match_len = 0;
+ 	int32		match_off;
+ 	int32		result_size;
+ 	int32		result_max;
+ 	int			i;
+ 	int32		need_rate;
+ 	const char *hp = history;
+ 	const char *hend = history + hlen;
+ 
+ 	/*
+ 	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ 	 * delta encode as this is the maximum size of history offset.
+ 	 */
+ 	if (hlen >= PGLZ_HISTORY_SIZE)
+ 		return false;
+ 
+ 	/*
+ 	 * Our fallback strategy is the default.
+ 	 */
+ 	if (strategy == NULL)
+ 		strategy = PGLZ_strategy_default;
+ 
+ 	/*
+ 	 * If the strategy forbids compression (at all or if source chunk size out
+ 	 * of range), fail.
+ 	 */
+ 	if (strategy->match_size_good <= 0 ||
+ 		slen < strategy->min_input_size ||
+ 		slen > strategy->max_input_size)
+ 		return false;
+ 
+ 	/*
+ 	 * Save the original source size in the header.
+ 	 */
+ 	dest->rawsize = slen;
+ 
+ 	need_rate = strategy->min_comp_rate;
+ 	if (need_rate < 0)
+ 		need_rate = 0;
+ 	else if (need_rate > 99)
+ 		need_rate = 99;
+ 
+ 	/*
+ 	 * Compute the maximum result size allowed by the strategy, namely the
+ 	 * input size minus the minimum wanted compression rate.  This had better
+ 	 * be <= slen, else we might overrun the provided output buffer.
+ 	 */
+ 	if (slen > (INT_MAX / 100))
+ 	{
+ 		/* Approximate to avoid overflow */
+ 		result_max = (slen / 100) * (100 - need_rate);
+ 	}
+ 	else
+ 		result_max = (slen * (100 - need_rate)) / 100;
+ 
+ 	/*
+ 	 * Compress the source directly into the output buffer until bitmaplen.
+ 	 */
+ 	dend = source + newbitmaplen;
+ 	while (dp < dend)
+ 	{
+ 		if (bp - bstart >= result_max)
+ 			return false;
+ 
+ 		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ 		dp++;				/* Do not do this ++ in the line above! */
+ 	}
+ 
+ 	/*
+ 	 * Loop through all attributes offsets, if the attribute data differs with
+ 	 * history refering offsets, store the [Offset,Length] reffering history
+ 	 * version till the match and store the changed data as New data.
+ 	 */
+ 	match_off = hbitmaplen;
+ 	hp = history + hbitmaplen;
+ 	for (i = 0; i < noffsets; i++)
+ 	{
+ 		dend = source + ((i + 1 == noffsets) ? slen : newoffsets[i + 1] + newbitmaplen);
+ 		hend = history + ((i + 1 == noffsets) ? hlen : hoffsets[i + 1] + hbitmaplen);
+ 
+ 		MATCH_AGAIN:
+ 		/*
+ 		 * If we already exceeded the maximum result size, fail.
+ 		 *
+ 		 * We check once per loop; since the loop body could emit as many as 4
+ 		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+ 		 * allow 4 slop bytes.
+ 		 */
+ 		if (bp - bstart >= result_max)
+ 			return false;
+ 
+ 		/*
+ 		 * Try to find a match in the history
+ 		 */
+ 		if (pglz_find_match_with_history(dp + match_len, dend, hp + match_len,
+ 										 hend, &match_len))
+ 		{
+ 			found_match = true;
+ 
+ 			/* Finding the maximum match across the offsets */
+ 			if ((i + 1 == noffsets)
+ 				|| ((dp + match_len) < dend)
+ 				|| ((hp + match_len < hend)))
+ 			{
+ 				/*
+ 				 * Create the tag and add history entries for all matched
+ 				 * characters.
+ 				 */
+ 				pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+ 				match_off += match_len;
+ 				dp += match_len;
+ 				hp += match_len;
+ 
+ 				if (match_len == PGLZ_MAX_MATCH)
+ 				{
+ 					match_len = 0;
+ 					goto MATCH_AGAIN;
+ 				}
+ 				else
+ 				{
+ 					hp = hend;
+ 					match_off = hend - history;
+ 					match_len = 0;
+ 				}
+ 			}
+ 		}
+ 		else
+ 		{
+ 			hp = hend;
+ 			match_off = hend - history;
+ 			match_len = 0;
+ 		}
+ 
+ 		/* copy the unmatched data to output buffer directly from source */
+ 		while ((dp + match_len) < dend)
+ 		{
+ 			if (bp - bstart >= result_max)
+ 				return false;
+ 
+ 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+ 			dp++;				/* Do not do this ++ in the line above! */
+ 		}
+ 	}
+ 
+ 	if (!found_match)
+ 		return false;
+ 
+ 	/*
+ 	 * Write out the last control byte and check that we haven't overrun the
+ 	 * output size allowed by the strategy.
+ 	 */
+ 	*ctrlp = ctrlb;
+ 	result_size = bp - bstart;
+ 
+ #ifdef DELTA_DEBUG
+ 	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+ #endif
+ 
+ 	/*
+ 	 * Success - need only fill in the actual length of the compressed datum.
+ 	 */
+ 	SET_VARSIZE_COMPRESSED(dest, result_size + sizeof(PGLZ_Header));
+ 
+ 	return true;
+ }
  
  /* ----------
   * pglz_decompress -
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 884,921 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest.
+  *		To decompress, it uses history if provided.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 699,724 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  					break;
  				}
  
! 				/*
! 				 * Now we copy the bytes specified by the tag from OUTPUT to
! 				 * OUTPUT. It is dangerous and platform dependent to use
! 				 * memcpy() here, because the copied areas could overlap
! 				 * extremely!
! 				 */
! 				while (len--)
  				{
! 					*dp = dp[-off];
! 					dp++;
  				}
  			}
  			else
  			{
  				/*
! 				 * An unset control bit means LITERAL BYTE. So we just copy
! 				 * one from INPUT to OUTPUT.
  				 */
! 				if (dp >= destend)		/* check for buffer overrun */
! 					break;		/* do not clobber memory */
  
  				*dp++ = *sp++;
  			}
--- 959,996 ----
  					break;
  				}
  
! 				if (history)
! 				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from history
! 					 * to OUTPUT.
! 					 */
! 					memcpy(dp, history + off, len);
! 					dp += len;
! 				}
! 				else
  				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from OUTPUT
! 					 * to OUTPUT. It is dangerous and platform dependent to
! 					 * use memcpy() here, because the copied areas could
! 					 * overlap extremely!
! 					 */
! 					while (len--)
! 					{
! 						*dp = dp[-off];
! 						dp++;
! 					}
  				}
  			}
  			else
  			{
  				/*
! 				 * An unset control bit means LITERAL BYTE. So we just
! 				 * copy one from INPUT to OUTPUT.
  				 */
! 				if (dp >= destend)	/* check for buffer overrun */
! 					break;	/* do not clobber memory */
  
  				*dp++ = *sp++;
  			}
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 123,128 **** extern int	CommitSiblings;
--- 123,129 ----
  extern char *default_tablespace;
  extern char *temp_tablespaces;
  extern bool synchronize_seqscans;
+ extern int	wal_update_compression_ratio;
  extern int	ssl_renegotiation_limit;
  extern char *SSLCipherSuites;
  
***************
*** 2382,2387 **** static struct config_int ConfigureNamesInt[] =
--- 2383,2399 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		/* Not for general use */
+ 		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ 			gettext_noop("Sets the compression ratio of delta record for wal update"),
+ 			NULL,
+ 		},
+ 		&wal_update_compression_ratio,
+ 		25, 1, 99,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 147,159 **** typedef struct xl_heap_update
  	TransactionId old_xmax;		/* xmax of the old tuple */
  	TransactionId new_xmax;		/* xmax of the new tuple */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 147,168 ----
  	TransactionId old_xmax;		/* xmax of the old tuple */
  	TransactionId new_xmax;		/* xmax of the new tuple */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
! 	int			flags;			/* flag bits, see below */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
! 														 * page's all visible
! 														 * bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
! 														 * page's all visible
! 														 * bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
! 														 * update operation is
! 														 * delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(int))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 687,692 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 687,697 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				HeapTuple newtup, char *encdata);
+ extern void heap_delta_decode (char *encdata, HeapTuple oldtup,
+ 				HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 261,266 **** typedef struct CheckpointStatsData
--- 261,267 ----
  extern CheckpointStatsData CheckpointStats;
  
  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+ extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
  extern void XLogFlush(XLogRecPtr RecPtr);
  extern bool XLogBackgroundFlush(void);
  extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 107,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
--- 107,119 ----
   */
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
+ extern bool pglz_compress_with_history(const char *source, int32 slen,
+ 									const char *history, int32 hlen,
+ 									int32 *newoffsets, int32 *hoffsets, int32 noffsets,
+ 									int32 newbitmaplen, int32 hbitmaplen,
+ 									PGLZ_Header *dest, const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+ extern void pglz_decompress_with_history(const char *source, char *dest,
+ 								uint32 *destlen, const char *history);
  
  #endif   /* _PG_LZCOMPRESS_H_ */
*** a/src/test/regress/expected/update.out
--- b/src/test/regress/expected/update.out
***************
*** 97,99 **** SELECT a, b, char_length(c) FROM update_test;
--- 97,169 ----
  (2 rows)
  
  DROP TABLE update_test;
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ DROP TABLE IF EXISTS update_test;
+ NOTICE:  table "update_test" does not exist, skipping
+ CREATE TABLE update_test (
+ 		bser bigserial,
+ 		bln boolean,
+ 		ename VARCHAR(25),
+ 		perf_f float(8),
+ 		grade CHAR,
+ 		dept CHAR(5) NOT NULL,
+ 		dob DATE,
+ 		idnum INT,
+ 		addr VARCHAR(30) NOT NULL,
+ 		destn CHAR(6),
+ 		Gend CHAR,
+ 		samba BIGINT,
+ 		hgt float,
+ 		ctime TIME
+ );
+ INSERT INTO update_test VALUES (
+ 		nextval('update_test_bser_seq'::regclass),
+ 		TRUE,
+ 		'Test',
+ 		7.169,
+ 		'B',
+ 		'CSD',
+ 		'2000-01-01',
+ 		520,
+ 		'road2,
+ 		streeeeet2,
+ 		city2',
+ 		'dcy2',
+ 		'M',
+ 		12000,
+ 		50.4,
+ 		'00:00:00.0'
+ );
+ SELECT * from update_test;
+  bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+     1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+       |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+       |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+ (1 row)
+ 
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ SELECT * from update_test;
+  bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+     1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+       |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+       |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+ (1 row)
+ 
+ DROP TABLE update_test;
*** a/src/test/regress/sql/update.sql
--- b/src/test/regress/sql/update.sql
***************
*** 59,61 **** UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
--- 59,128 ----
  SELECT a, b, char_length(c) FROM update_test;
  
  DROP TABLE update_test;
+ 
+ 
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ 
+ DROP TABLE IF EXISTS update_test;
+ CREATE TABLE update_test (
+ 		bser bigserial,
+ 		bln boolean,
+ 		ename VARCHAR(25),
+ 		perf_f float(8),
+ 		grade CHAR,
+ 		dept CHAR(5) NOT NULL,
+ 		dob DATE,
+ 		idnum INT,
+ 		addr VARCHAR(30) NOT NULL,
+ 		destn CHAR(6),
+ 		Gend CHAR,
+ 		samba BIGINT,
+ 		hgt float,
+ 		ctime TIME
+ );
+ 
+ INSERT INTO update_test VALUES (
+ 		nextval('update_test_bser_seq'::regclass),
+ 		TRUE,
+ 		'Test',
+ 		7.169,
+ 		'B',
+ 		'CSD',
+ 		'2000-01-01',
+ 		520,
+ 		'road2,
+ 		streeeeet2,
+ 		city2',
+ 		'dcy2',
+ 		'M',
+ 		12000,
+ 		50.4,
+ 		'00:00:00.0'
+ );
+ 
+ SELECT * from update_test;
+ 
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ 
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ 
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ 
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ 
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ 
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ 
+ SELECT * from update_test;
+ DROP TABLE update_test;

Amit Kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Amit Kapila (#8)

1 attachment(s)

On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:

On 29.01.2013 11:58, Amit Kapila wrote:

Can there be another way with which current patch code can be

made

better,

so that we don't need to change the encoding approach, as I am

having

feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch

more

like yours in behavior by also passing an array of offsets in the
new tuple to check, and only checking for matches as those offsets.

I think it makes sense, because if we have offsets of both new and

old

tuple, we can internally use memcmp to compare columns and use same
algorithm for encoding.
I will change the patch according to this suggestion.

I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed
bitmaplength also, as we need to copy the bitmap of new tuple as it is
into Encoded WAL Tuple.

Please see if such API design is okay?

I shall update the README and send the performance/WAL Reduction data
for modified patch tomorrow.

Updated patch including comments and README is attached with this mail.
This patch contain exactly same design behavior as per previous.
It takes care of API design suggestion of Heikki.

The performance data is similar, as it is not complete, I shall send that
tomorrow.

With Regards,
Amit Kapila.

Attachments:

wal_update_changes_v10.patchapplication/octet-stream; name=wal_update_changes_v10.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,66 ****
--- 60,70 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
+ #include "utils/pg_lzcompress.h"
  
+ /* guc variable for EWT compression ratio*/
+ int			wal_update_compression_ratio = 25;
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
  #define ATT_IS_PACKABLE(att) \
***************
*** 69,74 ****
--- 73,80 ----
  #define VARLENA_ATT_IS_PACKABLE(att) \
  	((att)->attstorage != 'p')
  
+ static void heap_get_attr_offsets(TupleDesc tupleDesc, HeapTuple Tuple,
+ 					  int32 **offsets, int *noffsets);
  
  /* ----------------------------------------------------------------
   *						misc support routines
***************
*** 617,622 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 623,775 ----
  	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
  }
  
+ /* ----------------
+  * heap_get_attr_offsets
+  *
+  *		Given a tuple, extract each attribute's starting offset and return
+  *	it as an array of offsets for a heap tuple.
+  *	If the attribute has null value, the offset for it will be end of
+  *	previous attribute offset.
+  * ----------------
+  */
+ static void
+ heap_get_attr_offsets(TupleDesc tupleDesc, HeapTuple Tuple,
+ 					  int32 **offsets, int *noffsets)
+ {
+ 	HeapTupleHeader tup = Tuple->t_data;
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	bool		hasnulls = HeapTupleHasNulls(Tuple);
+ 	bits8	   *bp = Tuple->t_data->t_bits;		/* ptr to null bitmap in tuple */
+ 	bool		slow = false;	/* can we use/set attcacheoff? */
+ 	char	   *tp;				/* ptr to tuple data */
+ 	long		off;			/* offset in tuple data */
+ 	int			natts;
+ 	int			attnum;
+ 
+ 	natts = HeapTupleHeaderGetNatts(Tuple->t_data);
+ 
+ 	*offsets = palloc(natts * sizeof(int32));
+ 
+ 	*noffsets = 0;
+ 
+ 	/* copied from heap_deform_tuple */
+ 	tp = (char *) tup + tup->t_hoff;
+ 	off = 0;
+ 	for (attnum = 0; attnum < natts; attnum++)
+ 	{
+ 		Form_pg_attribute thisatt = att[attnum];
+ 
+ 		if (hasnulls && att_isnull(attnum, bp))
+ 		{
+ 			slow = true;		/* can't use attcacheoff anymore */
+ 			(*offsets)[(*noffsets)++] = off;
+ 			continue;
+ 		}
+ 
+ 		if (!slow && thisatt->attcacheoff >= 0)
+ 			off = thisatt->attcacheoff;
+ 		else if (thisatt->attlen == -1)
+ 		{
+ 			/*
+ 			 * We can only cache the offset for a varlena attribute if the
+ 			 * offset is already suitably aligned, so that there would be no
+ 			 * pad bytes in any case: then the offset will be valid for either
+ 			 * an aligned or unaligned value.
+ 			 */
+ 			if (!slow &&
+ 				off == att_align_nominal(off, thisatt->attalign))
+ 				thisatt->attcacheoff = off;
+ 			else
+ 			{
+ 				off = att_align_pointer(off, thisatt->attalign, -1,
+ 										tp + off);
+ 				slow = true;
+ 			}
+ 		}
+ 		else
+ 		{
+ 			/* not varlena, so safe to use att_align_nominal */
+ 			off = att_align_nominal(off, thisatt->attalign);
+ 
+ 			if (!slow)
+ 				thisatt->attcacheoff = off;
+ 		}
+ 
+ 		(*offsets)[(*noffsets)++] = off;
+ 
+ 		off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ 
+ 		if (thisatt->attlen <= 0)
+ 			slow = true;		/* can't use attcacheoff anymore */
+ 
+ 	}
+ 
+ }
+ 
+ /* ----------------
+  * heap_delta_encode
+  *
+  *		Calculate the delta between two tuples, using pglz. The result is
+  * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+  * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  char *encdata)
+ {
+ 	int32	   *hoffsets,
+ 			   *newoffsets;
+ 	int			noffsets;
+ 	PGLZ_Strategy strategy;
+ 	int32		newbitmaplen,
+ 				hbitmpalen;
+ 
+ 	/*
+ 	 * If length of old and new tuple versions vary by more than 50%, include
+ 	 * new as-is
+ 	 */
+ 	if ((newtup->t_len <= (oldtup->t_len >> 1))
+ 		|| (oldtup->t_len <= (newtup->t_len >> 1)))
+ 		return false;
+ 
+ 	newbitmaplen = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 	hbitmpalen = oldtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * Deform and get the attribute offsets for old and new tuple which will
+ 	 * be used for calculating delta between old and new tuples.
+ 	 */
+ 	heap_get_attr_offsets(tupleDesc, oldtup, &hoffsets, &noffsets);
+ 	heap_get_attr_offsets(tupleDesc, newtup, &newoffsets, &noffsets);
+ 
+ 	strategy = *PGLZ_strategy_always;
+ 	strategy.min_comp_rate = wal_update_compression_ratio;
+ 
+ 	return pglz_compress_with_history((char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 					   newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ 			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 					   oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+ 									  newoffsets, hoffsets, noffsets,
+ 									  newbitmaplen, hbitmpalen,
+ 									  (PGLZ_Header *) encdata, &strategy);
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *
+  *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+  * ----------------
+  */
+ void
+ heap_delta_decode(char *encdata, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	return pglz_decompress_with_history((char *) encdata,
+ 			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 										&newtup->t_len,
+ 			(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits));
+ }
+ 
  /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 70,75 ****
--- 70,76 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
+ #include "utils/pg_lzcompress.h"
  
  
  /* GUC variable */
***************
*** 5765,5770 **** log_heap_update(Relation reln, Buffer oldbuf,
--- 5766,5781 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds EWT */
+ 	struct
+ 	{
+ 		PGLZ_Header pglzheader;
+ 		char		buf[MaxHeapTupleSize];
+ 	}			buf;
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 5774,5788 **** log_heap_update(Relation reln, Buffer oldbuf,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = oldtup->t_self;
  	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
  	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
  											  oldtup->t_data->t_infomask2);
  	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 5785,5830 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * EWT can be generated for all new tuple versions created by Update
+ 	 * operation. Currently we do it when both the old and new tuple versions
+ 	 * are on same page, because during recovery if the page containing old
+ 	 * tuple is corrupt, it should not cascade that corruption to other pages.
+ 	 * Under the general assumption that for long runs most updates tend to
+ 	 * create new tuple version on same page, there should not be significant
+ 	 * impact on WAL reduction or performance.
+ 	 *
+ 	 * We should not generate EWT when we need to backup the whole bolck in
+ 	 * WAL as in that case there is no saving by reduced WAL size.
+ 	 */
+ 	if ((oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ 	{
+ 		/* Delta-encode the new tuple using the old tuple */
+ 		if (heap_delta_encode(reln->rd_att, oldtup, newtup, (char *) &buf.pglzheader))
+ 		{
+ 			compressed = true;
+ 			newtupdata = (char *) &buf.pglzheader;
+ 			newtuplen = VARSIZE(&buf.pglzheader);
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = oldtup->t_self;
  	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
  	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
  											  oldtup->t_data->t_infomask2);
  	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 5809,5817 **** log_heap_update(Relation reln, Buffer oldbuf,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 5851,5862 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! 	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 6614,6620 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 6659,6668 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 6629,6635 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 6677,6683 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 6689,6695 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
  	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
--- 6737,6743 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
  	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
***************
*** 6707,6713 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 6755,6761 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 6732,6738 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 6780,6786 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 6795,6804 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 6843,6874 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the record is EWT then decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/*
! 		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
! 		 * + New data (1 byte length + variable data)+ ...
! 		 */
! 		PGLZ_Header *encoded_data = (PGLZ_Header *) (((char *) xlrec) + hsize);
! 
! 		oldtup.t_data = oldtupdata;
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode((char *) encoded_data, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 6814,6820 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 6884,6890 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
***************
*** 665,670 **** then restart recovery.  This is part of the reason for not writing a WAL
--- 665,784 ----
  entry until we've successfully done the original action.
  
  
+ Encoded WAL Tuple (EWT)
+ -----------------------
+ 
+ Delta Encoded WAL Tuple (EWT) eliminates the need for copying entire tuple
+ to WAL for the update operation. EWT is constructed using pglz by comparing
+ old and new versions of tuple w.r.t column boundaries. It contains the data
+ from new tuple for modified columns and reference [Offset,Length] of old tuple
+ verion for un-changed columns.
+ 
+ 
+ EWT Format
+ ----------
+ 
+ Header + Control byte + History Reference (2 - 3)bytes
+ 	+ New data (1 byte length + variable data) + ...
+ 
+ 
+ Header:
+ 
+ The header is same as PGLZ_Header, which is used to store the compressed length
+ and raw length.
+ 
+ Control byte:
+ 
+ The first byte after the header tells what to do the next 8 times. We call this
+ the control byte.
+ 
+ 
+ History Reference:
+ 
+ A set bit in the control byte means, that a tag of 2-3 bytes follows.
+ A tag contains information to copy some bytes from old tuple version to
+ the current location in the output.
+ 
+    Details about 2-3 byte Tag
+    2 byte tag is used when length of History data
+    (unchanged data from old tuple version) is less than 18.
+    3 byte tag is used when length of History data
+    (unchanged data from old tuple version) is greater than equal to 18.
+    The maximum length that can be represented by one Tag is 273.
+ 
+    Let's call the three tag bytes T1, T2 and T3. The position of the data
+    to copy is coded as an offset from the old tuple.
+ 
+    The offset is in the upper nibble of T1 and in T2.
+    The length is in the lower nibble of T1.
+ 
+    So the 16 bits of a 2 byte tag are coded as
+ 
+        7---T1--0  7---T2--0
+        OOOO LLLL  OOOO OOOO
+ 
+    This limits the offset to 1-4095 (12 bits) and the length to 3-18 (4 bits)
+    because 3 is always added to it.
+ 
+    In the actual implementation, the 2 byte tag's length is limited to 3-17,
+    because the value 0xF in the length nibble has special meaning. It means,
+    that the next following byte (T3) has to be added to the length value of 18.
+    That makes total limits of 1-4095 for offset and 3-273 for length.
+ 
+ 
+  New data:
+ 
+ An unset bit in the control byte represents modified data of new tuple version.
+ First byte repersents the length [0-255] of the modified data, followed by the
+ modified data of corresponding length.
+ 
+ 	7---T1--0  7---T2--0  ...
+ 	LLLL LLLL  DDDD DDDD  ...
+ 
+     Data bytes repeat until the length of the new data.
+ 
+ 
+ L - Length
+ O - Offset
+ D - Data
+ 
+ 
+ Encoding Mechanism for EWT
+ --------------------------
+ Copy the bitmap data from new tuple to the EWT (Encoded WAL Tuple)
+ and loop for all attributes to find any modifications in the attributes.
+ The unmodified data is encoded as a History Reference in EWT and the
+ modified data (if NOT NULL) is encoded as New Data in EWT.
+ 
+ The offset values are calculated with respect to the tuple t_hoff value.
+ Max encoded data length is 75% (default compression rate) of original data,
+ if encoded output data length is greater than that, original tuple
+ (new tuple version) will be directly stored in WAL Tuple.
+ 
+ 
+ Decoding Mechanism for EWT
+ --------------------------
+ Skip header and Read one control byte and process the next 8 items
+ (or as many as remain in the compressed input). Check each control bit,
+ if the bit is set then it is History Reference which means the next
+ 2 - 3 byte tag provides the offset and length of history match.
+ 
+ Use the offset and corresponding length to copy data from old tuple
+ version to new tuple. If the control bit is unset, then it is
+ New Data Reference which means first byte contains the length [0-255]
+ of the modified data, followed by the modified data of corresponding length
+ specified in the first byte.
+ 
+ 
+ Constraints for EWT
+ --------------------
+ 1. Delta encoding is allowed when the update is going to the same page and
+    buffer doesn't need a backup block in case of full-pagewrite is on.
+ 2. Old Tuples with length less than PGLZ_HISTORY_SIZE are allowed for encoding.
+ 3. Old and New tuple versions shouldn't vary in length by more than 50%
+    are allowed for encoding.
+ 
+ 
  Asynchronous Commit
  -------------------
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1209,1214 **** begin:;
--- 1209,1236 ----
  }
  
  /*
+  * Determine whether the buffer referenced has to be backed up. Since we don't
+  * yet have the insert lock, fullPageWrites and forcePageWrites could change
+  * later, but will not cause any problem because this function is used only to
+  * identify whether EWT is required for WAL update.
+  */
+ bool
+ XLogCheckBufferNeedsBackup(Buffer buffer)
+ {
+ 	bool		doPageWrites;
+ 	Page		page;
+ 
+ 	page = BufferGetPage(buffer);
+ 
+ 	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+ 
+ 	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ 		return true;			/* buffer requires backup */
+ 
+ 	return false;				/* buffer does not need to be backed up */
+ }
+ 
+ /*
   * Determine whether the buffer referenced by an XLogRecData item has to
   * be backed up, and if so fill a BkpBlock struct for it.  In any case
   * save the buffer's LSN at *lsn.
*** a/src/backend/utils/adt/pg_lzcompress.c
--- b/src/backend/utils/adt/pg_lzcompress.c
***************
*** 362,367 **** do { \
--- 362,391 ----
  	}																		\
  } while (0)
  
+ /* ----------
+  * pglz_out_add -
+  *
+  *			  Outputs a reference tag of 1 byte with length and the new data
+  *			  to the destination buffer, including the appropriate control bit.
+  * ----------
+  */
+ #define pglz_out_add(_ctrlp,_ctrlb,_ctrl,_buf,_len,_byte) \
+ do { \
+ 	  int32 _maddlen;													\
+ 	  int32 _addtotal_len = (_len);										\
+ 	  while (_addtotal_len > 0)											\
+ 	  {																	\
+ 			  _maddlen = _addtotal_len > 255 ? 255 : _addtotal_len;		\
+ 			  pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);					\
+ 			  _ctrl <<= 1;												\
+ 			  (_buf)[0] = (unsigned char)(_maddlen);					\
+ 			  (_buf) += 1;												\
+ 			  memcpy((_buf), (_byte), _maddlen);						\
+ 			  (_buf) += _maddlen;										\
+ 			  (_byte) += _maddlen;										\
+ 			  _addtotal_len -= _maddlen;								\
+ 	  }																	\
+ } while (0)
  
  /* ----------
   * pglz_find_match -
***************
*** 471,476 **** pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
--- 495,539 ----
  	return 0;
  }
  
+ /* ----------
+  * pglz_find_match -
+  *
+  *		Lookup the history table if the actual input stream matches
+  *		another sequence of characters, starting somewhere earlier
+  *		in the input buffer.
+  * ----------
+  */
+ static inline int
+ pglz_find_match_with_history(const char *input, const char *end,
+ 							 const char *history, const char *hend, int *lenp)
+ {
+ 	const char *ip = input;
+ 	const char *hp = history;
+ 
+ 	/*
+ 	 * Determine length of match. A better match must be larger than the best
+ 	 * so far. And if we already have a match of 16 or more bytes, it's worth
+ 	 * the call overhead to use memcmp() to check if this match is equal for
+ 	 * the same size. After that we must fallback to character by character
+ 	 * comparison to know the exact position where the diff occurred.
+ 	 */
+ 	while (ip < end && hp < hend && *ip == *hp && *lenp < PGLZ_MAX_MATCH)
+ 	{
+ 		(*lenp)++;
+ 		ip++;
+ 		hp++;
+ 	}
+ 
+ 	/*
+ 	 * Return match information only if it results at least in one byte
+ 	 * reduction.
+ 	 */
+ 	if (*lenp > 2)
+ 		return 1;
+ 
+ 	return 0;
+ }
+ 
  
  /* ----------
   * pglz_compress -
***************
*** 637,642 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
--- 700,895 ----
  	return true;
  }
  
+ /* ----------
+  * pglz_compress_with_history
+  *
+  * Like pglz_compress, but performs delta encoding rather than compression.
+  * The references are offsets from the start of history data, rather
+  * than current output position. 'hoffsets' and 'newoffsets' are array of
+  * offsets in the history and source to consider. We scan the history
+  * string based on attribute offsets for possible matches with source string.
+  *
+  * For attributes having NULL value, the offset will be same as next attribute
+  * offset. When old tuple contains NULL and new tuple has non-NULL value,
+  * it will copy it as New Data in Encoded WAL Tuple. When new tuple has NULL
+  * value and old tuple has non-NULL value, the old tuple value will be ignored.
+  * ----------
+  */
+ bool
+ pglz_compress_with_history(const char *source, int32 slen,
+ 						   const char *history, int32 hlen,
+ 						   int32 *newoffsets, int32 *hoffsets, int32 noffsets,
+ 						   int32 newbitmaplen, int32 hbitmaplen,
+ 						   PGLZ_Header *dest, const PGLZ_Strategy *strategy)
+ {
+ 	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
+ 	unsigned char *bstart = bp;
+ 	const char *dp = source;
+ 	const char *dend = source + slen;
+ 	unsigned char ctrl_dummy = 0;
+ 	unsigned char *ctrlp = &ctrl_dummy;
+ 	unsigned char ctrlb = 0;
+ 	unsigned char ctrl = 0;
+ 	bool		found_match = false;
+ 	int32		match_len = 0;
+ 	int32		match_off;
+ 	int32		result_size;
+ 	int32		result_max;
+ 	int			i,
+ 				len;
+ 	int32		need_rate;
+ 	const char *hp = history;
+ 	const char *hend = history + hlen;
+ 
+ 	/*
+ 	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+ 	 * delta encode as this is the maximum size of history offset.
+ 	 */
+ 	if (hlen >= PGLZ_HISTORY_SIZE)
+ 		return false;
+ 
+ 	/*
+ 	 * Our fallback strategy is the default.
+ 	 */
+ 	if (strategy == NULL)
+ 		strategy = PGLZ_strategy_default;
+ 
+ 	/*
+ 	 * If the strategy forbids compression (at all or if source chunk size out
+ 	 * of range), fail.
+ 	 */
+ 	if (strategy->match_size_good <= 0 ||
+ 		slen < strategy->min_input_size ||
+ 		slen > strategy->max_input_size)
+ 		return false;
+ 
+ 	/*
+ 	 * Save the original source size in the header.
+ 	 */
+ 	dest->rawsize = slen;
+ 
+ 	need_rate = strategy->min_comp_rate;
+ 	if (need_rate < 0)
+ 		need_rate = 0;
+ 	else if (need_rate > 99)
+ 		need_rate = 99;
+ 
+ 	/*
+ 	 * Compute the maximum result size allowed by the strategy, namely the
+ 	 * input size minus the minimum wanted compression rate.  This had better
+ 	 * be <= slen, else we might overrun the provided output buffer.
+ 	 */
+ 	if (slen > (INT_MAX / 100))
+ 	{
+ 		/* Approximate to avoid overflow */
+ 		result_max = (slen / 100) * (100 - need_rate);
+ 	}
+ 	else
+ 		result_max = (slen * (100 - need_rate)) / 100;
+ 
+ 	/*
+ 	 * Compress the source directly into the output buffer until bitmaplen.
+ 	 */
+ 	if ((bp + newbitmaplen + 2) - bstart >= result_max)
+ 		return false;
+ 
+ 	pglz_out_add(ctrlp, ctrlb, ctrl, bp, newbitmaplen, dp);
+ 
+ 	/*
+ 	 * Loop through all attributes offsets, if the attribute data differs with
+ 	 * history referring offsets, store the [Offset,Length] reffering history
+ 	 * version till the match and store the changed data as New data. We need
+ 	 * to accumulate all the matched attributes till an unmatched one is
+ 	 * found. For the last attribute if it is matched, directly store its
+ 	 * Offset. It can be improved for accumulation of unmatched attributes.
+ 	 */
+ 	match_off = hbitmaplen;
+ 	hp = history + hbitmaplen;
+ 	for (i = 0; i < noffsets; i++)
+ 	{
+ 		dend = source + ((i + 1 == noffsets) ? slen : newoffsets[i + 1] + newbitmaplen);
+ 		hend = history + ((i + 1 == noffsets) ? hlen : hoffsets[i + 1] + hbitmaplen);
+ 
+ MATCH_AGAIN:
+ 
+ 		/* If we already exceeded the maximum result size, fail. */
+ 		if (bp - bstart >= result_max)
+ 			return false;
+ 
+ 		/*
+ 		 * Try to find a match in the history. It can match maximum
+ 		 * PGLZ_MAX_MATCH in one pass as history tag can be of 3 bytes. For
+ 		 * match greater than PGLZ_MAX_MATCH, it need to do it in multiple
+ 		 * passes (MATCH_AGAIN).
+ 		 */
+ 		if (pglz_find_match_with_history(dp + match_len, dend, hp + match_len,
+ 										 hend, &match_len))
+ 		{
+ 			found_match = true;
+ 
+ 			/* Finding the maximum match across the offsets */
+ 			if ((i + 1 == noffsets)
+ 				|| ((dp + match_len) < dend)
+ 				|| ((hp + match_len < hend)))
+ 			{
+ 				/*
+ 				 * Create the tag and add history entries for all matched
+ 				 * characters.
+ 				 */
+ 				pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+ 				match_off += match_len;
+ 				dp += match_len;
+ 				hp += match_len;
+ 
+ 				if (match_len == PGLZ_MAX_MATCH)
+ 				{
+ 					match_len = 0;
+ 					goto MATCH_AGAIN;
+ 				}
+ 				else
+ 				{
+ 					hp = hend;
+ 					match_off = hend - history;
+ 					match_len = 0;
+ 				}
+ 			}
+ 		}
+ 		else
+ 		{
+ 			hp = hend;
+ 			match_off = hend - history;
+ 			match_len = 0;
+ 		}
+ 
+ 		/* copy the unmatched data to output buffer directly from source */
+ 		len = dend - (dp + match_len);
+ 		if ((bp + len + 2) - bstart >= result_max)
+ 			return false;
+ 
+ 		pglz_out_add(ctrlp, ctrlb, ctrl, bp, len, dp);
+ 	}
+ 
+ 	if (!found_match)
+ 		return false;
+ 
+ 	/*
+ 	 * Write out the last control byte and check that we haven't overrun the
+ 	 * output size allowed by the strategy.
+ 	 */
+ 	*ctrlp = ctrlb;
+ 	result_size = bp - bstart;
+ 
+ #ifdef DELTA_DEBUG
+ 	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+ #endif
+ 
+ 	/*
+ 	 * Success - need only fill in the actual length of the compressed datum.
+ 	 */
+ 	SET_VARSIZE_COMPRESSED(dest, result_size + sizeof(PGLZ_Header));
+ 
+ 	return true;
+ }
  
  /* ----------
   * pglz_decompress -
***************
*** 647,661 **** pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(source);
  	dp = (unsigned char *) dest;
! 	destend = dp + source->rawsize;
  
  	while (sp < srcend && dp < destend)
  	{
--- 900,937 ----
  void
  pglz_decompress(const PGLZ_Header *source, char *dest)
  {
+ 	pglz_decompress_with_history((char *) source, dest, NULL, NULL);
+ }
+ 
+ /* ----------
+  * pglz_decompress_with_history -
+  *
+  *		Decompresses source into dest.
+  *		To decompress, it uses history if provided.
+  * ----------
+  */
+ void
+ pglz_decompress_with_history(const char *source, char *dest, uint32 *destlen,
+ 							 const char *history)
+ {
+ 	PGLZ_Header src;
  	const unsigned char *sp;
  	const unsigned char *srcend;
  	unsigned char *dp;
  	unsigned char *destend;
  
+ 	/* To avoid the unaligned access of PGLZ_Header */
+ 	memcpy((char *) &src, source, sizeof(PGLZ_Header));
+ 
  	sp = ((const unsigned char *) source) + sizeof(PGLZ_Header);
! 	srcend = ((const unsigned char *) source) + VARSIZE(&src);
  	dp = (unsigned char *) dest;
! 	destend = dp + src.rawsize;
! 
! 	if (destlen)
! 	{
! 		*destlen = src.rawsize;
! 	}
  
  	while (sp < srcend && dp < destend)
  	{
***************
*** 665,670 **** pglz_decompress(const PGLZ_Header *source, char *dest)
--- 941,947 ----
  		 */
  		unsigned char ctrl = *sp++;
  		int			ctrlc;
+ 		int32		len;
  
  		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
  		{
***************
*** 677,683 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  				 * coded as 18, another extension tag byte tells how much
  				 * longer the match really was (0-255).
  				 */
- 				int32		len;
  				int32		off;
  
  				len = (sp[0] & 0x0f) + 3;
--- 954,959 ----
***************
*** 699,726 **** pglz_decompress(const PGLZ_Header *source, char *dest)
  					break;
  				}
  
! 				/*
! 				 * Now we copy the bytes specified by the tag from OUTPUT to
! 				 * OUTPUT. It is dangerous and platform dependent to use
! 				 * memcpy() here, because the copied areas could overlap
! 				 * extremely!
! 				 */
! 				while (len--)
  				{
! 					*dp = dp[-off];
! 					dp++;
  				}
  			}
  			else
  			{
! 				/*
! 				 * An unset control bit means LITERAL BYTE. So we just copy
! 				 * one from INPUT to OUTPUT.
! 				 */
! 				if (dp >= destend)		/* check for buffer overrun */
! 					break;		/* do not clobber memory */
  
! 				*dp++ = *sp++;
  			}
  
  			/*
--- 975,1030 ----
  					break;
  				}
  
! 				if (history)
  				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from history
! 					 * to OUTPUT.
! 					 */
! 					memcpy(dp, history + off, len);
! 					dp += len;
! 				}
! 				else
! 				{
! 					/*
! 					 * Now we copy the bytes specified by the tag from OUTPUT
! 					 * to OUTPUT. It is dangerous and platform dependent to
! 					 * use memcpy() here, because the copied areas could
! 					 * overlap extremely!
! 					 */
! 					while (len--)
! 					{
! 						*dp = dp[-off];
! 						dp++;
! 					}
  				}
  			}
  			else
  			{
! 				if (history)
! 				{
! 					len = sp[0];
! 					sp++;
  
! 					/*
! 					 * Now we copy the bytes specified by the len from source
! 					 * to OUTPUT.
! 					 */
! 					memcpy(dp, sp, len);
! 					sp += len;
! 					dp += len;
! 				}
! 				else
! 				{
! 					/*
! 					 * An unset control bit means LITERAL BYTE. So we just
! 					 * copy one from INPUT to OUTPUT.
! 					 */
! 					if (dp >= destend)	/* check for buffer overrun */
! 						break;	/* do not clobber memory */
! 
! 					*dp++ = *sp++;
! 				}
  			}
  
  			/*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 123,128 **** extern int	CommitSiblings;
--- 123,129 ----
  extern char *default_tablespace;
  extern char *temp_tablespaces;
  extern bool synchronize_seqscans;
+ extern int	wal_update_compression_ratio;
  extern int	ssl_renegotiation_limit;
  extern char *SSLCipherSuites;
  
***************
*** 2382,2387 **** static struct config_int ConfigureNamesInt[] =
--- 2383,2399 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		/* Not for general use */
+ 		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ 			gettext_noop("Sets the compression ratio of delta record for wal update"),
+ 			NULL,
+ 		},
+ 		&wal_update_compression_ratio,
+ 		25, 1, 99,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 147,159 **** typedef struct xl_heap_update
  	TransactionId old_xmax;		/* xmax of the old tuple */
  	TransactionId new_xmax;		/* xmax of the new tuple */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 147,168 ----
  	TransactionId old_xmax;		/* xmax of the old tuple */
  	TransactionId new_xmax;		/* xmax of the new tuple */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
! 	int			flags;			/* flag bits, see below */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
! 														 * page's all visible
! 														 * bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
! 														 * page's all visible
! 														 * bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
! 														 * update operation is
! 														 * delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(int))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 687,692 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 687,697 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				  HeapTuple newtup, char *encdata);
+ extern void heap_delta_decode(char *encdata, HeapTuple oldtup,
+ 				  HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 261,266 **** typedef struct CheckpointStatsData
--- 261,267 ----
  extern CheckpointStatsData CheckpointStats;
  
  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+ extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
  extern void XLogFlush(XLogRecPtr RecPtr);
  extern bool XLogBackgroundFlush(void);
  extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
*** a/src/include/utils/pg_lzcompress.h
--- b/src/include/utils/pg_lzcompress.h
***************
*** 107,112 **** extern const PGLZ_Strategy *const PGLZ_strategy_always;
--- 107,119 ----
   */
  extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
  			  const PGLZ_Strategy *strategy);
+ extern bool pglz_compress_with_history(const char *source, int32 slen,
+ 						   const char *history, int32 hlen,
+ 						   int32 *newoffsets, int32 *hoffsets, int32 noffsets,
+ 						   int32 newbitmaplen, int32 hbitmaplen,
+ 						   PGLZ_Header *dest, const PGLZ_Strategy *strategy);
  extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+ extern void pglz_decompress_with_history(const char *source, char *dest,
+ 							 uint32 *destlen, const char *history);
  
  #endif   /* _PG_LZCOMPRESS_H_ */
*** a/src/test/regress/expected/update.out
--- b/src/test/regress/expected/update.out
***************
*** 97,99 **** SELECT a, b, char_length(c) FROM update_test;
--- 97,169 ----
  (2 rows)
  
  DROP TABLE update_test;
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ DROP TABLE IF EXISTS update_test;
+ NOTICE:  table "update_test" does not exist, skipping
+ CREATE TABLE update_test (
+ 		bser bigserial,
+ 		bln boolean,
+ 		ename VARCHAR(25),
+ 		perf_f float(8),
+ 		grade CHAR,
+ 		dept CHAR(5) NOT NULL,
+ 		dob DATE,
+ 		idnum INT,
+ 		addr VARCHAR(30) NOT NULL,
+ 		destn CHAR(6),
+ 		Gend CHAR,
+ 		samba BIGINT,
+ 		hgt float,
+ 		ctime TIME
+ );
+ INSERT INTO update_test VALUES (
+ 		nextval('update_test_bser_seq'::regclass),
+ 		TRUE,
+ 		'Test',
+ 		7.169,
+ 		'B',
+ 		'CSD',
+ 		'2000-01-01',
+ 		520,
+ 		'road2,
+ 		streeeeet2,
+ 		city2',
+ 		'dcy2',
+ 		'M',
+ 		12000,
+ 		50.4,
+ 		'00:00:00.0'
+ );
+ SELECT * from update_test;
+  bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+     1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+       |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+       |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+ (1 row)
+ 
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ SELECT * from update_test;
+  bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+     1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+       |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+       |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+ (1 row)
+ 
+ DROP TABLE update_test;
*** a/src/test/regress/sql/update.sql
--- b/src/test/regress/sql/update.sql
***************
*** 59,61 **** UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
--- 59,128 ----
  SELECT a, b, char_length(c) FROM update_test;
  
  DROP TABLE update_test;
+ 
+ 
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ 
+ DROP TABLE IF EXISTS update_test;
+ CREATE TABLE update_test (
+ 		bser bigserial,
+ 		bln boolean,
+ 		ename VARCHAR(25),
+ 		perf_f float(8),
+ 		grade CHAR,
+ 		dept CHAR(5) NOT NULL,
+ 		dob DATE,
+ 		idnum INT,
+ 		addr VARCHAR(30) NOT NULL,
+ 		destn CHAR(6),
+ 		Gend CHAR,
+ 		samba BIGINT,
+ 		hgt float,
+ 		ctime TIME
+ );
+ 
+ INSERT INTO update_test VALUES (
+ 		nextval('update_test_bser_seq'::regclass),
+ 		TRUE,
+ 		'Test',
+ 		7.169,
+ 		'B',
+ 		'CSD',
+ 		'2000-01-01',
+ 		520,
+ 		'road2,
+ 		streeeeet2,
+ 		city2',
+ 		'dcy2',
+ 		'M',
+ 		12000,
+ 		50.4,
+ 		'00:00:00.0'
+ );
+ 
+ SELECT * from update_test;
+ 
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ 
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ 
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ 
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ 
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ 
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ 
+ SELECT * from update_test;
+ DROP TABLE update_test;

#10

Amit Kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Amit Kapila (#9)

1 attachment(s)

On Thursday, January 31, 2013 6:44 PM Amit Kapila wrote:

On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:

On 29.01.2013 11:58, Amit Kapila wrote:

Can there be another way with which current patch code can be

made

better,

so that we don't need to change the encoding approach, as I am

having

feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch

more

like yours in behavior by also passing an array of offsets in the
new tuple to check, and only checking for matches as those

offsets.

I think it makes sense, because if we have offsets of both new and

old

tuple, we can internally use memcmp to compare columns and use same
algorithm for encoding.
I will change the patch according to this suggestion.

I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed
bitmaplength also, as we need to copy the bitmap of new tuple as it

is

into Encoded WAL Tuple.

Please see if such API design is okay?

I shall update the README and send the performance/WAL Reduction data
for modified patch tomorrow.

Updated patch including comments and README is attached with this mail.
This patch contain exactly same design behavior as per previous.
It takes care of API design suggestion of Heikki.

The performance data is similar, as it is not complete, I shall send
that tomorrow.

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):

1. With orignal pgbench there is a max 7% WAL reduction with not much
performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with not much
performance difference.
3. With 500 and above record size in pgbench there is an improvement in the
performance and wal reduction both.

If the record size increases there is a gain in performance and wal size is
reduced as well.

Performance data for synchronous_commit = on is under progress, I shall post
it once it is done.
I am expecting it to be same as previous.

With Regards,
Amit Kapila.

#11

Amit Kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Amit Kapila (#10)

1 attachment(s)

On Friday, February 01, 2013 6:37 PM Amit Kapila wrote:

On Thursday, January 31, 2013 6:44 PM Amit Kapila wrote:

On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:

On 29.01.2013 11:58, Amit Kapila wrote:

Can there be another way with which current patch code can be

made

better,

so that we don't need to change the encoding approach, as I

am

having

feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know
the internals of pglz compression. You could probably make my
patch

more

like yours in behavior by also passing an array of offsets in
the new tuple to check, and only checking for matches as those

offsets.

I think it makes sense, because if we have offsets of both new

and

old

tuple, we can internally use memcmp to compare columns and use
same algorithm for encoding.
I will change the patch according to this suggestion.

I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed
bitmaplength also, as we need to copy the bitmap of new tuple as it

is

into Encoded WAL Tuple.

Please see if such API design is okay?

I shall update the README and send the performance/WAL Reduction
data for modified patch tomorrow.

Updated patch including comments and README is attached with this

mail.

This patch contain exactly same design behavior as per previous.
It takes care of API design suggestion of Heikki.

The performance data is similar, as it is not complete, I shall send
that tomorrow.

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):

1. With orignal pgbench there is a max 7% WAL reduction with not much
performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with not
much performance difference.
3. With 500 and above record size in pgbench there is an improvement in
the performance and wal reduction both.

If the record size increases there is a gain in performance and wal
size is reduced as well.

Performance data for synchronous_commit = on is under progress, I shall
post it once it is done.
I am expecting it to be same as previous.

Please find the performance readings for synchronous_commit = on.

Each run is taken for 20 min.

Conclusions from the readings with synchronous commit on mode:

1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% with not much
performance difference.
3. With 1800 record size in pgbench there is both an improvement in the
performance (approx 3%) as well as wal reduction (44%).

If the record size increases there is a very good reduction in WAL size.

Please provide your feedback.

With Regards,
Amit Kapila.

#12

Craig Ringer

craig@2ndquadrant.com

almost 13 years ago

In reply to: Amit Kapila (#11)

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):

1. With orignal pgbench there is a max 7% WAL reduction with not much
performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with not
much performance difference.
3. With 500 and above record size in pgbench there is an improvement in
the performance and wal reduction both.

If the record size increases there is a gain in performance and wal
size is reduced as well.

Performance data for synchronous_commit = on is under progress, I shall
post it once it is done.
I am expecting it to be same as previous.

Please find the performance readings for synchronous_commit = on.

Each run is taken for 20 min.

Conclusions from the readings with synchronous commit on mode:

1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% with not much
performance difference.
3. With 1800 record size in pgbench there is both an improvement in the
performance (approx 3%) as well as wal reduction (44%).

If the record size increases there is a very good reduction in WAL size.

The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Amit Kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Craig Ringer (#12)

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):

1. With orignal pgbench there is a max 7% WAL reduction with not

much

performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with

not

much performance difference.
3. With 500 and above record size in pgbench there is an improvement

in

the performance and wal reduction both.

If the record size increases there is a gain in performance and wal
size is reduced as well.

Performance data for synchronous_commit = on is under progress, I

shall

post it once it is done.
I am expecting it to be same as previous.

Please find the performance readings for synchronous_commit = on.

Each run is taken for 20 min.

Conclusions from the readings with synchronous commit on mode:

1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% with

not much

performance difference.
3. With 1800 record size in pgbench there is both an improvement in

the

performance (approx 3%) as well as wal reduction (44%).

If the record size increases there is a very good reduction in WAL

size.

The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.

For bigger size (~2000) records, it goes into toast, for which we don't do
this optimization.
This optimization is mainly for medium size records.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Heikki Linnakangas

hlinnakangas@vmware.com

almost 13 years ago

In reply to: Amit Kapila (#13)

2 attachment(s)

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):

1. With orignal pgbench there is a max 7% WAL reduction with not

much

performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with

not

much performance difference.
3. With 500 and above record size in pgbench there is an improvement

in

the performance and wal reduction both.

If the record size increases there is a gain in performance and wal
size is reduced as well.

Performance data for synchronous_commit = on is under progress, I

shall

post it once it is done.
I am expecting it to be same as previous.

Please find the performance readings for synchronous_commit = on.

Each run is taken for 20 min.

Conclusions from the readings with synchronous commit on mode:

1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% with

not much

performance difference.
3. With 1800 record size in pgbench there is both an improvement in

the

performance (approx 3%) as well as wal reduction (44%).

If the record size increases there is a very good reduction in WAL

size.

The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.

For bigger size (~2000) records, it goes into toast, for which we don't do
this optimization.
This optimization is mainly for medium size records.

I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll begin
with some numbers:

unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 1245525360 | 9.94613695144653
two short fields, one changed | 1245536528 | 10.146910905838
two short fields, both changed | 1245523160 | 11.2332470417023
one short and one long field, no change | 1054926504 | 5.90477800369263
ten tiny fields, all changed | 1411774608 | 13.4536008834839
hundred tiny fields, all changed | 635739680 | 7.57448387145996
hundred tiny fields, half changed | 636930560 | 7.56888699531555
hundred tiny fields, half nulled | 573751120 | 6.68991994857788

Amit's wal_update_changes_v10.patch:

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 1249722112 | 13.0558869838715
two short fields, one changed | 1246145408 | 12.9947438240051
two short fields, both changed | 1245951056 | 13.0262880325317
one short and one long field, no change | 678480664 | 5.70031690597534
ten tiny fields, all changed | 1328873920 | 20.0167419910431
hundred tiny fields, all changed | 638149416 | 14.4236788749695
hundred tiny fields, half changed | 635560504 | 14.8770561218262
hundred tiny fields, half nulled | 558468352 | 16.2437210083008

pglz-with-micro-optimizations-1.patch:

testname | wal_generated |
duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1245519008 | 11.6702048778534
two short fields, one changed | 1245756904 | 11.3233819007874
two short fields, both changed | 1249711088 | 11.6836447715759
one short and one long field, no change | 664741392 | 6.44810795783997
ten tiny fields, all changed | 1328085568 | 13.9679481983185
hundred tiny fields, all changed | 635974088 | 9.15514206886292
hundred tiny fields, half changed | 636309040 | 9.13769292831421
hundred tiny fields, half nulled | 496396448 | 8.77351498603821

In each test, a table is created with a large number of identical rows,
and fillfactor=50. Then a full-table UPDATE is performed, and the UPDATE
is timed. Duration is the time spent in the UPDATE (lower is better),
and wal_generated is the amount of WAL generated by the updates (lower
is better).

The summary is that Amit's patch is a small win in terms of CPU usage,
in the best case where the table has few columns, with one large column
that is not updated. In all other cases it just adds overhead. In terms
of WAL size, you get a big gain in the same best case scenario.

Attached is a different version of this patch, which uses the pglz
algorithm to spot the similarities between the old and new tuple,
instead of having explicit knowledge of where the column boundaries are.
This has the advantage that it will spot similarities, and be able to
compress, in more cases. For example, you can see a reduction in WAL
size in the "hundred tiny fields, half nulled" test case above.

The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default, this
probably just isn't worth it.

The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function, it
goes further than that, and contains some further micro-optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more. One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for speed.

If you could squeeze pglz_delta_encode function to be cheap enough that
we could enable this by default, this would be pretty cool patch. Or at
least, the overhead in the cases that you get no compression needs to be
brought down, to about 2-5 % at most I think. If it can't be done
easily, I feel that this probably needs to be dropped.

PS. I haven't done much testing of WAL redo, so it's quite possible that
the encoding is actually buggy, or that decoding is slow. But I don't
think there's anything so fundamentally wrong that it would affect the
performance results much.

- Heikki

Attachments:

pglz-with-micro-optimizations-1.patchtext/x-diff; name=pglz-with-micro-optimizations-1.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..d6458b2 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
 
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
 #define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+							 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+							 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 oldtup->t_len);
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d226726..5a9bea9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
+extern int     wal_update_compression_ratio;
 
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5803,6 +5805,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -5812,15 +5820,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole bolck in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32 enclen;
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -5847,9 +5887,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6652,7 +6695,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -6667,7 +6713,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6727,7 +6773,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6745,7 +6791,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -6770,7 +6816,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6833,10 +6879,30 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + New data (1 byte length + variable data)+ ...
+		 */
+		oldtup.t_data = oldtupdata;
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -6852,7 +6918,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d960bbc..c721392 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1236,6 +1236,28 @@ begin:;
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..96c5c61b 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
  *			of identical bytes like trailing spaces) and for bigger ones
  *			our 4K maximum look-back distance is too small.
  *
- *			The compressor creates a table for 8192 lists of positions.
+ *			The compressor creates a table for lists of positions.
  *			For each input position (except the last 3), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.
+ *			back-pointers larger than that anyway.  The size of the hash
+ *			table depends on the size of the input - a larger table has
+ *			a larger startup cost, as it needs to be initialized to zero,
+ *			but reduces the number of hash collisions on long inputs.
  *
  *			For each byte in the input, it's hash key (built from this
  *			byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
  * Local definitions
  * ----------
  */
-#define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
-#define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
 #define PGLZ_HISTORY_SIZE		4096
 #define PGLZ_MAX_MATCH			273
 
@@ -198,9 +200,9 @@
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	int16		next;			/* links for my hash key's list */
+	int16		prev;
+	uint32		hindex;			/* my current hash key */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * Statically allocated work arrays for history
  * ----------
  */
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY			0
 
 /* ----------
  * pglz_hist_idx -
@@ -257,12 +261,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e) (												\
+#define pglz_hist_idx(_s,_e, _mask) (										\
 			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 9) ^ ((_s)[1] << 6) ^								\
-			  ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK)				\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)							\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) 									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) 							\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -276,32 +308,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e));						\
-			PGLZ_HistEntry **__myhsp = &(_hs)[__hindex];					\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
+				if (__myhe->prev == INVALID_ENTRY)							\
 					(_hs)[__myhe->hindex] = __myhe->next;					\
 				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
+					(_he)[__myhe->prev].next = __myhe->next;				\
+				if (__myhe->next != INVALID_ENTRY)							\
+					(_he)[__myhe->next].prev = __myhe->prev;				\
 			}																\
 			__myhe->next = *__myhsp;										\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
-			if (*__myhsp != NULL)											\
-				(*__myhsp)->prev = __myhe;									\
-			*__myhsp = __myhe;												\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE) {								\
-				(_hn) = 0;													\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
+				(_hn) = 1;													\
 				(_recycle) = true;											\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex)	\
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = *__myhsp;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
+
 
 /* ----------
  * pglz_out_ctrl -
@@ -372,30 +421,44 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex)
 {
-	PGLZ_HistEntry *hent;
+	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hent = hstart[pglz_hist_idx(input, end)];
-	while (hent)
+	hentno = hstart[hindex];
+	while (hentno != INVALID_ENTRY)
 	{
+		PGLZ_HistEntry *hent = &hist_entries[hentno];
 		const char *ip = input;
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		maxlen = PGLZ_MAX_MATCH;
+		if (input - end > maxlen)
+			maxlen = input - end;
+		if (hend && (hend - hp > maxlen))
+			maxlen = hend - hp;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
+		if (!hend)
+		{
 		thisoff = ip - hp;
 		if (thisoff >= 0x0fff)
 			break;
+		}
+		else
+			thisoff = hend - hp;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -413,7 +476,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -423,7 +486,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -443,13 +506,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		hentno = hent->next;
 
 		/*
 		 * Be happy with lesser good matches the more entries we visited. But
 		 * no point in doing calculation if we're at end of list.
 		 */
-		if (hent)
+		if (hentno != INVALID_ENTRY)
 		{
 			if (len >= good_match)
 				break;
@@ -471,6 +534,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For
+	 * a small input, the startup cost dominates. The table size must be
+	 * a power of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -484,7 +570,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 {
 	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
 	unsigned char *bstart = bp;
-	int			hist_next = 0;
+	int			hist_next = 1;
 	bool		hist_recycle = false;
 	const char *dp = source;
 	const char *dend = source + slen;
@@ -500,6 +586,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
+	int			hashsz;
+	int			mask;
 
 	/*
 	 * Our fallback strategy is the default.
@@ -555,17 +643,21 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Initialize the history lists to empty.  We do not need to zero the
 	 * hist_entries[] array; its entries are initialized as they are used.
 	 */
-	memset(hist_start, 0, sizeof(hist_start));
+	memset(hist_start, 0, hashsz * sizeof(int16));
 
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -588,8 +680,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop))
+							&match_off, good_match, good_drop, NULL, hindex))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -598,9 +691,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -614,7 +708,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -637,6 +731,200 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,b,c,d;
+	int32		hindex;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	if (hend - hp > PGLZ_HISTORY_SIZE)
+		hp = hend - PGLZ_HISTORY_SIZE;
+	pglz_hash_init(hp, hindex, a,b,c,d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the
+		 * new data, too.
+		 */
+		pglz_hash_roll(hp, hindex, a,b,c,d, mask);
+		pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 hp, hend, hindex);
+		hp++;			/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, hindex,a,b,c,d);
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp,hindex,a,b,c,d,mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			dp += match_len;
+			found_match = true;
+			pglz_hash_init(dp, hindex,a,b,c,d);
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.	*/
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;				/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -740,3 +1028,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history
+				 * to OUTPUT.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just
+				 * copy one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)	/* check for buffer overrun */
+					break;	/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98149fc..0e50f6d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool synchronize_seqscans;
+extern int	wal_update_compression_ratio;
 extern int	ssl_renegotiation_limit;
 extern char *SSLCipherSuites;
 
@@ -2383,6 +2384,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..5a40457 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
 	TransactionId old_xmax;		/* xmax of the old tuple */
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
-	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+														 * update operation is
+														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8a65492..df178a1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

wal-update-testsuite.shapplication/x-sh; name=wal-update-testsuite.shDownload

#15

Amit Kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Heikki Linnakangas (#14)

On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous

patch):

I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll begin
with some numbers:

unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):

testname | wal_generated |
duration

-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1245525360 |
9.94613695144653
two short fields, one changed | 1245536528 |
10.146910905838
two short fields, both changed | 1245523160 |
11.2332470417023
one short and one long field, no change | 1054926504 |
5.90477800369263
ten tiny fields, all changed | 1411774608 |
13.4536008834839
hundred tiny fields, all changed | 635739680 |
7.57448387145996
hundred tiny fields, half changed | 636930560 |
7.56888699531555
hundred tiny fields, half nulled | 573751120 |
6.68991994857788

Amit's wal_update_changes_v10.patch:

testname | wal_generated |
duration

-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1249722112 |
13.0558869838715
two short fields, one changed | 1246145408 |
12.9947438240051
two short fields, both changed | 1245951056 |
13.0262880325317
one short and one long field, no change | 678480664 |
5.70031690597534
ten tiny fields, all changed | 1328873920 |
20.0167419910431
hundred tiny fields, all changed | 638149416 |
14.4236788749695
hundred tiny fields, half changed | 635560504 |
14.8770561218262
hundred tiny fields, half nulled | 558468352 |
16.2437210083008

pglz-with-micro-optimizations-1.patch:

testname | wal_generated |
duration
-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1245519008 |
11.6702048778534
two short fields, one changed | 1245756904 |
11.3233819007874
two short fields, both changed | 1249711088 |
11.6836447715759
one short and one long field, no change | 664741392 |
6.44810795783997
ten tiny fields, all changed | 1328085568 |
13.9679481983185
hundred tiny fields, all changed | 635974088 |
9.15514206886292
hundred tiny fields, half changed | 636309040 |
9.13769292831421
hundred tiny fields, half nulled | 496396448 |
8.77351498603821

For some of the tests, it doesn't even execute main part of
compression/encoding.
The reason is that the length of tuple is less than strategy min length, so
it returns from below check
in function pglz_delta_encode()
if (strategy->match_size_good <= 0 ||
slen < strategy->min_input_size ||
slen > strategy->max_input_size)
return false;

The tests for which it doesn't execute encoding are below:
two short fields, no change
two short fields, one changed
two short fields, both changed
ten tiny fields, all changed

For above cases, the reason of difference in timings for both approaches
with original could be due to the reason that
this check is done after some processing. So I think if we check the length
in log_heap_update, then
there should not be timing difference for above test scenario's. I can check
that once.

This optimization helps only when tuple is of length > 128~200 bytes and
upto 1800 bytes (till it turns to toast), otherwise it could result in
overhead without any major WAL reduction.
Infact I think in one of my initial patch there is a check if length of
tuple is greater than 128 bytes then perform the optimization.

I shall try to run both patches for cases when length of tuple is > 128~200
bytes, as this optimization has benefits in those cases.

In each test, a table is created with a large number of identical rows,
and fillfactor=50. Then a full-table UPDATE is performed, and the
UPDATE is timed. Duration is the time spent in the UPDATE (lower is
better), and wal_generated is the amount of WAL generated by the
updates (lower is better).

The summary is that Amit's patch is a small win in terms of CPU usage,
in the best case where the table has few columns, with one large column
that is not updated. In all other cases it just adds overhead. In terms
of WAL size, you get a big gain in the same best case scenario.

Attached is a different version of this patch, which uses the pglz
algorithm to spot the similarities between the old and new tuple,
instead of having explicit knowledge of where the column boundaries
are.
This has the advantage that it will spot similarities, and be able to
compress, in more cases. For example, you can see a reduction in WAL
size in the "hundred tiny fields, half nulled" test case above.

The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default,
this probably just isn't worth it.

As I mentioned, for smaller tuples it can be overhead without any major
benefit of WAL reduction,
so I think before doing encoding it should ensure that tuple length is
greater than some threshold length.
Yes, it can miss some cases like your test has shown for (hundred tiny
fields, half nulled),
but we might be able to safely enable it for default.

The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function,
it goes further than that, and contains some further micro-
optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more.

One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for
speed.

Do you mean to say roll for 10 times and then call pglz_hist_add_no_recycle
and then same
before pglz_find_match?

I shall try doing this for the tests.

If you could squeeze pglz_delta_encode function to be cheap enough that
we could enable this by default, this would be pretty cool patch. Or at
least, the overhead in the cases that you get no compression needs to
be brought down, to about 2-5 % at most I think. If it can't be done
easily, I feel that this probably needs to be dropped.

Agreed, though it gives benefit for some of the cases, but it should not
degrade much
for any of other cases.

One more thing that any compression technique will have some overhead, so it
should be
used optimally rather then in every case. So in that regards, I think we
should do this
optimization only when it has better chance of win (like based on length of
tuple or some other criteria
where WAL tuple can be logged as-is). What is your opinion?

PS. I haven't done much testing of WAL redo, so it's quite possible
that the encoding is actually buggy, or that decoding is slow. But I
don't think there's anything so fundamentally wrong that it would
affect the performance results much.

I also don't think it will have any problem, but I can run some test to
verify the same.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Andres Freund

andres@2ndquadrant.com

almost 13 years ago

In reply to: Heikki Linnakangas (#14)

On 2013-03-05 23:26:59 +0200, Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.

For bigger size (~2000) records, it goes into toast, for which we don't do
this optimization.
This optimization is mainly for medium size records.

I've been doing investigating the pglz option further, and doing performance
comparisons of the pglz approach and this patch. I'll begin with some
numbers:

unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 1245525360 | 9.94613695144653
two short fields, one changed | 1245536528 | 10.146910905838
two short fields, both changed | 1245523160 | 11.2332470417023
one short and one long field, no change | 1054926504 | 5.90477800369263
ten tiny fields, all changed | 1411774608 | 13.4536008834839
hundred tiny fields, all changed | 635739680 | 7.57448387145996
hundred tiny fields, half changed | 636930560 | 7.56888699531555
hundred tiny fields, half nulled | 573751120 | 6.68991994857788

Amit's wal_update_changes_v10.patch:

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 1249722112 | 13.0558869838715
two short fields, one changed | 1246145408 | 12.9947438240051
two short fields, both changed | 1245951056 | 13.0262880325317
one short and one long field, no change | 678480664 | 5.70031690597534
ten tiny fields, all changed | 1328873920 | 20.0167419910431
hundred tiny fields, all changed | 638149416 | 14.4236788749695
hundred tiny fields, half changed | 635560504 | 14.8770561218262
hundred tiny fields, half nulled | 558468352 | 16.2437210083008

pglz-with-micro-optimizations-1.patch:

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1245519008 | 11.6702048778534
two short fields, one changed | 1245756904 | 11.3233819007874
two short fields, both changed | 1249711088 | 11.6836447715759
one short and one long field, no change | 664741392 | 6.44810795783997
ten tiny fields, all changed | 1328085568 | 13.9679481983185
hundred tiny fields, all changed | 635974088 | 9.15514206886292
hundred tiny fields, half changed | 636309040 | 9.13769292831421
hundred tiny fields, half nulled | 496396448 | 8.77351498603821

In each test, a table is created with a large number of identical rows, and
fillfactor=50. Then a full-table UPDATE is performed, and the UPDATE is
timed. Duration is the time spent in the UPDATE (lower is better), and
wal_generated is the amount of WAL generated by the updates (lower is
better).

The summary is that Amit's patch is a small win in terms of CPU usage, in
the best case where the table has few columns, with one large column that is
not updated. In all other cases it just adds overhead. In terms of WAL size,
you get a big gain in the same best case scenario.

Attached is a different version of this patch, which uses the pglz algorithm
to spot the similarities between the old and new tuple, instead of having
explicit knowledge of where the column boundaries are. This has the
advantage that it will spot similarities, and be able to compress, in more
cases. For example, you can see a reduction in WAL size in the "hundred tiny
fields, half nulled" test case above.

The attached patch also just adds overhead in most cases, but the overhead
is much smaller in the worst case. I think that's the right tradeoff here -
we want to avoid scenarios where performance falls off the cliff. That said,
if you usually just get a slowdown, we certainly can't make this the
default, and if we can't turn it on by default, this probably just isn't
worth it.

The attached patch contains the variable-hash-size changes I posted in the
"Optimizing pglz compressor". But in the delta encoding function, it goes
further than that, and contains some further micro-optimizations: the hash
is calculated in a rolling fashion, and it uses a specialized version of the
pglz_hist_add macro that knows that the input can't exceed 4096 bytes. Those
changes shaved off some cycles, but you could probably do more. One idea is
to only add every 10 bytes or so to the history lookup table; that would
sacrifice some compressibility for speed.

If you could squeeze pglz_delta_encode function to be cheap enough that we
could enable this by default, this would be pretty cool patch. Or at least,
the overhead in the cases that you get no compression needs to be brought
down, to about 2-5 % at most I think. If it can't be done easily, I feel
that this probably needs to be dropped.

While this is exciting stuff - and I find Heikki's approach more
interesting and applicable to more cases - I think this is clearly not
9.3 material anymore. There are loads of tradeoffs here which requires
substantial amount of benchmarking and its not the kind of change that
can be backed out easily during 9.3's lifecycle.

And I have to say I find 2-5% performance overhead too high...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Amit Kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Heikki Linnakangas (#14)

6 attachment(s)

On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous

patch):

I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll begin
with some numbers:

unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):

testname | wal_generated |
duration

-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1245525360 |
9.94613695144653
two short fields, one changed | 1245536528 |
10.146910905838
two short fields, both changed | 1245523160 |
11.2332470417023
one short and one long field, no change | 1054926504 |
5.90477800369263
ten tiny fields, all changed | 1411774608 |
13.4536008834839
hundred tiny fields, all changed | 635739680 |
7.57448387145996
hundred tiny fields, half changed | 636930560 |
7.56888699531555
hundred tiny fields, half nulled | 573751120 |
6.68991994857788

Amit's wal_update_changes_v10.patch:

testname | wal_generated |
duration

-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1249722112 |
13.0558869838715
two short fields, one changed | 1246145408 |
12.9947438240051
two short fields, both changed | 1245951056 |
13.0262880325317
one short and one long field, no change | 678480664 |
5.70031690597534
ten tiny fields, all changed | 1328873920 |
20.0167419910431
hundred tiny fields, all changed | 638149416 |
14.4236788749695
hundred tiny fields, half changed | 635560504 |
14.8770561218262
hundred tiny fields, half nulled | 558468352 |
16.2437210083008

pglz-with-micro-optimizations-1.patch:

testname | wal_generated |
duration
-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1245519008 |
11.6702048778534
two short fields, one changed | 1245756904 |
11.3233819007874
two short fields, both changed | 1249711088 |
11.6836447715759
one short and one long field, no change | 664741392 |
6.44810795783997
ten tiny fields, all changed | 1328085568 |
13.9679481983185
hundred tiny fields, all changed | 635974088 |
9.15514206886292
hundred tiny fields, half changed | 636309040 |
9.13769292831421
hundred tiny fields, half nulled | 496396448 |
8.77351498603821

In each test, a table is created with a large number of identical rows,
and fillfactor=50. Then a full-table UPDATE is performed, and the
UPDATE is timed. Duration is the time spent in the UPDATE (lower is
better), and wal_generated is the amount of WAL generated by the
updates (lower is better).

Based on your patch, I have tried some more optimizations:

Fixed bug in your patch (pglz-with-micro-optimizations-2):
1. There were some problems in recovery due to wrong length of oldtuple
passed in decode which I have corrected.

Approach -1 (pglz-with-micro-optimizations-2_roll10_32)
1. Move strategy min length (32) check in log_heap_update
2. Rolling 10 for hash as suggested by you is added.

Approach -2 (pglz-with-micro-optimizations-2_roll10_32_1hashkey)
1. This is done on top of Approach-1 changes
2. Used 1 byte data as the hash key.

Approach-3
(pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal)
1. This is done on top of Approach-1 and Approach-2 changes
2. Instead of doing copy of literal byte when it founds as non match with
history, do all in a batch.

Data for all above approaches is in attached file "test_readings" (Apart
from your tests, I have added one more test " hundred tiny fields, first 10
changed")

Summary -
After changes of Approach-1, CPU utilization for all except 2 tests
("hundred tiny fields, all changed",
"hundred tiny fields, half changed") is either same or less. The best case
CPU utilization has decreased (which is better), but WAL reduction has
little bit increased (which is as per expectation due 10 consecutive
rollup's).

Approach-2 modifications was done to see if there is any overhead of hash
calculation.
Approach-2 & Approach-3 doesn't result into any improvements.

I have investigated the reason for CPU utilization for 2 tests and the
reason is that there is nothing to compress in the new tuple and that
information it will come to know only after it processes 75% (compression
ratio) of tuple bytes.
I think any compression algorithm will have this drawback that if data is
not compressible, it can consume time inspite
of the fact that it will not be able to compress the data.
I think most updates will update some part of tuple which will always yield
positive results.

Apart from above tests, I have run your patch against my old tests, it
yields quite positive results,
WAL Reduction is more as compare to my patch and CPU utilization is almost
similar or my patch is slightly better.
The results are in attached file "pgbench_pg_lz_mod"

The above all data is for synchronous_commit = off. I can collect the data
for synchronous_commit = on and
Performance of recovery.

Any further suggestions?

With Regards,
Amit Kapila.

Attachments:

pgbench_pg_lz_mod.htmtext/html; name=pgbench_pg_lz_mod.htmDownload

pglz-with-micro-optimizations-2.patchapplication/octet-stream; name=pglz-with-micro-optimizations-2.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
 
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
 #define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+							 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+							 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5250ec7..5b69189 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
+extern int     wal_update_compression_ratio;
 
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5806,6 +5808,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -5815,15 +5823,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole bolck in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32 enclen;
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -5850,9 +5890,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6655,7 +6698,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -6670,7 +6716,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6730,7 +6776,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6748,7 +6794,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -6773,7 +6819,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6836,10 +6882,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + New data (1 byte length + variable data)+ ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -6855,7 +6922,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a02eebc..5075d2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1236,6 +1236,28 @@ begin:;
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..96c5c61b 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
  *			of identical bytes like trailing spaces) and for bigger ones
  *			our 4K maximum look-back distance is too small.
  *
- *			The compressor creates a table for 8192 lists of positions.
+ *			The compressor creates a table for lists of positions.
  *			For each input position (except the last 3), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.
+ *			back-pointers larger than that anyway.  The size of the hash
+ *			table depends on the size of the input - a larger table has
+ *			a larger startup cost, as it needs to be initialized to zero,
+ *			but reduces the number of hash collisions on long inputs.
  *
  *			For each byte in the input, it's hash key (built from this
  *			byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
  * Local definitions
  * ----------
  */
-#define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
-#define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
 #define PGLZ_HISTORY_SIZE		4096
 #define PGLZ_MAX_MATCH			273
 
@@ -198,9 +200,9 @@
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	int16		next;			/* links for my hash key's list */
+	int16		prev;
+	uint32		hindex;			/* my current hash key */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * Statically allocated work arrays for history
  * ----------
  */
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY			0
 
 /* ----------
  * pglz_hist_idx -
@@ -257,12 +261,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e) (												\
+#define pglz_hist_idx(_s,_e, _mask) (										\
 			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 9) ^ ((_s)[1] << 6) ^								\
-			  ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK)				\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)							\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) 									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) 							\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -276,32 +308,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e));						\
-			PGLZ_HistEntry **__myhsp = &(_hs)[__hindex];					\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
+				if (__myhe->prev == INVALID_ENTRY)							\
 					(_hs)[__myhe->hindex] = __myhe->next;					\
 				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
+					(_he)[__myhe->prev].next = __myhe->next;				\
+				if (__myhe->next != INVALID_ENTRY)							\
+					(_he)[__myhe->next].prev = __myhe->prev;				\
 			}																\
 			__myhe->next = *__myhsp;										\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
-			if (*__myhsp != NULL)											\
-				(*__myhsp)->prev = __myhe;									\
-			*__myhsp = __myhe;												\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE) {								\
-				(_hn) = 0;													\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
+				(_hn) = 1;													\
 				(_recycle) = true;											\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex)	\
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = *__myhsp;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
+
 
 /* ----------
  * pglz_out_ctrl -
@@ -372,30 +421,44 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex)
 {
-	PGLZ_HistEntry *hent;
+	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hent = hstart[pglz_hist_idx(input, end)];
-	while (hent)
+	hentno = hstart[hindex];
+	while (hentno != INVALID_ENTRY)
 	{
+		PGLZ_HistEntry *hent = &hist_entries[hentno];
 		const char *ip = input;
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		maxlen = PGLZ_MAX_MATCH;
+		if (input - end > maxlen)
+			maxlen = input - end;
+		if (hend && (hend - hp > maxlen))
+			maxlen = hend - hp;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
+		if (!hend)
+		{
 		thisoff = ip - hp;
 		if (thisoff >= 0x0fff)
 			break;
+		}
+		else
+			thisoff = hend - hp;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -413,7 +476,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -423,7 +486,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -443,13 +506,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		hentno = hent->next;
 
 		/*
 		 * Be happy with lesser good matches the more entries we visited. But
 		 * no point in doing calculation if we're at end of list.
 		 */
-		if (hent)
+		if (hentno != INVALID_ENTRY)
 		{
 			if (len >= good_match)
 				break;
@@ -471,6 +534,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For
+	 * a small input, the startup cost dominates. The table size must be
+	 * a power of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -484,7 +570,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 {
 	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
 	unsigned char *bstart = bp;
-	int			hist_next = 0;
+	int			hist_next = 1;
 	bool		hist_recycle = false;
 	const char *dp = source;
 	const char *dend = source + slen;
@@ -500,6 +586,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
+	int			hashsz;
+	int			mask;
 
 	/*
 	 * Our fallback strategy is the default.
@@ -555,17 +643,21 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Initialize the history lists to empty.  We do not need to zero the
 	 * hist_entries[] array; its entries are initialized as they are used.
 	 */
-	memset(hist_start, 0, sizeof(hist_start));
+	memset(hist_start, 0, hashsz * sizeof(int16));
 
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -588,8 +680,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop))
+							&match_off, good_match, good_drop, NULL, hindex))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -598,9 +691,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -614,7 +708,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -637,6 +731,200 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,b,c,d;
+	int32		hindex;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	if (hend - hp > PGLZ_HISTORY_SIZE)
+		hp = hend - PGLZ_HISTORY_SIZE;
+	pglz_hash_init(hp, hindex, a,b,c,d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the
+		 * new data, too.
+		 */
+		pglz_hash_roll(hp, hindex, a,b,c,d, mask);
+		pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 hp, hend, hindex);
+		hp++;			/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, hindex,a,b,c,d);
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp,hindex,a,b,c,d,mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			dp += match_len;
+			found_match = true;
+			pglz_hash_init(dp, hindex,a,b,c,d);
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.	*/
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;				/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -740,3 +1028,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history
+				 * to OUTPUT.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just
+				 * copy one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)	/* check for buffer overrun */
+					break;	/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98149fc..0e50f6d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool synchronize_seqscans;
+extern int	wal_update_compression_ratio;
 extern int	ssl_renegotiation_limit;
 extern char *SSLCipherSuites;
 
@@ -2383,6 +2384,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..5a40457 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
 	TransactionId old_xmax;		/* xmax of the old tuple */
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
-	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+														 * update operation is
+														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8a65492..df178a1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

pglz-with-micro-optimizations-2_roll10_32.patchapplication/octet-stream; name=pglz-with-micro-optimizations-2_roll10_32.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
 
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
 #define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+							 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+							 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5250ec7..4dcf164 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
+extern int     wal_update_compression_ratio;
 
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5806,6 +5808,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -5815,15 +5823,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole bolck in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_compression_ratio != 0 && (newtuplen >=32) && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32 enclen;
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -5850,9 +5890,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6655,7 +6698,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -6670,7 +6716,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6730,7 +6776,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6748,7 +6794,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -6773,7 +6819,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6836,10 +6882,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + New data (1 byte length + variable data)+ ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -6855,7 +6922,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a02eebc..5075d2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1236,6 +1236,28 @@ begin:;
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..8aec6bd 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
  *			of identical bytes like trailing spaces) and for bigger ones
  *			our 4K maximum look-back distance is too small.
  *
- *			The compressor creates a table for 8192 lists of positions.
+ *			The compressor creates a table for lists of positions.
  *			For each input position (except the last 3), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.
+ *			back-pointers larger than that anyway.	The size of the hash
+ *			table depends on the size of the input - a larger table has
+ *			a larger startup cost, as it needs to be initialized to zero,
+ *			but reduces the number of hash collisions on long inputs.
  *
  *			For each byte in the input, it's hash key (built from this
  *			byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
  * Local definitions
  * ----------
  */
-#define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
-#define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
 #define PGLZ_HISTORY_SIZE		4096
 #define PGLZ_MAX_MATCH			273
 
@@ -198,9 +200,9 @@
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	int16		next;			/* links for my hash key's list */
+	int16		prev;
+	uint32		hindex;			/* my current hash key */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * Statically allocated work arrays for history
  * ----------
  */
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY			0
 
 /* ----------
  * pglz_hist_idx -
@@ -257,12 +261,39 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e) (												\
+#define pglz_hist_idx(_s,_e, _mask) (										\
 			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 9) ^ ((_s)[1] << 6) ^								\
-			  ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK)				\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)							\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) 									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) 							\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
 
 /* ----------
  * pglz_hist_add -
@@ -276,32 +307,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e));						\
-			PGLZ_HistEntry **__myhsp = &(_hs)[__hindex];					\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
+				if (__myhe->prev == INVALID_ENTRY)							\
 					(_hs)[__myhe->hindex] = __myhe->next;					\
 				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
+					(_he)[__myhe->prev].next = __myhe->next;				\
+				if (__myhe->next != INVALID_ENTRY)							\
+					(_he)[__myhe->next].prev = __myhe->prev;				\
 			}																\
 			__myhe->next = *__myhsp;										\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
-			if (*__myhsp != NULL)											\
-				(*__myhsp)->prev = __myhe;									\
-			*__myhsp = __myhe;												\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE) {								\
-				(_hn) = 0;													\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
+				(_hn) = 1;													\
 				(_recycle) = true;											\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex)	\
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = *__myhsp;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
+
 
 /* ----------
  * pglz_out_ctrl -
@@ -372,30 +420,44 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex)
 {
-	PGLZ_HistEntry *hent;
+	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hent = hstart[pglz_hist_idx(input, end)];
-	while (hent)
+	hentno = hstart[hindex];
+	while (hentno != INVALID_ENTRY)
 	{
+		PGLZ_HistEntry *hent = &hist_entries[hentno];
 		const char *ip = input;
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		maxlen = PGLZ_MAX_MATCH;
+		if (input - end > maxlen)
+			maxlen = input - end;
+		if (hend && (hend - hp > maxlen))
+			maxlen = hend - hp;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		if (!hend)
+		{
+			thisoff = ip - hp;
+			if (thisoff >= 0x0fff)
+				break;
+		}
+		else
+			thisoff = hend - hp;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -413,7 +475,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -423,7 +485,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -443,13 +505,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		hentno = hent->next;
 
 		/*
 		 * Be happy with lesser good matches the more entries we visited. But
 		 * no point in doing calculation if we're at end of list.
 		 */
-		if (hent)
+		if (hentno != INVALID_ENTRY)
 		{
 			if (len >= good_match)
 				break;
@@ -471,6 +533,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -484,7 +569,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 {
 	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
 	unsigned char *bstart = bp;
-	int			hist_next = 0;
+	int			hist_next = 1;
 	bool		hist_recycle = false;
 	const char *dp = source;
 	const char *dend = source + slen;
@@ -500,6 +585,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
+	int			hashsz;
+	int			mask;
 
 	/*
 	 * Our fallback strategy is the default.
@@ -555,17 +642,22 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Initialize the history lists to empty.  We do not need to zero the
 	 * hist_entries[] array; its entries are initialized as they are used.
 	 */
-	memset(hist_start, 0, sizeof(hist_start));
+	memset(hist_start, 0, hashsz * sizeof(int16));
 
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
+
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -588,8 +680,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop))
+							&match_off, good_match, good_drop, NULL, hindex))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -598,9 +691,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -614,7 +708,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -637,6 +731,225 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		hindex;
+	int32		a,b,c,d;
+	int32		rollidx;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	if (hend - hp > PGLZ_HISTORY_SIZE)
+		hp = hend - PGLZ_HISTORY_SIZE;
+
+	rollidx = 0;
+	pglz_hash_init(hp, hindex, a,b,c,d);
+	while (hp < hend)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the new
+		 * data, too.
+		 */
+		pglz_hash_roll(hp, hindex, a, b, c, d, mask);
+		if (rollidx % 10 == 0)
+		{
+			pglz_hist_add_no_recycle(hist_start, hist_entries,
+									 hist_next,
+									 hp, hend, hindex);
+			rollidx = 0;
+		}
+		hp++;					/* Do not do this ++ in the line above! */
+		rollidx++;
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	rollidx = 0;
+	pglz_hash_init(hp, hindex, a,b,c,d);
+	while (dp < dend)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp, hindex, a, b, c, d, mask);
+		if (rollidx % 10 == 0)
+		{
+			if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex))
+			{
+				/*
+				 * Create the tag and add history entries for all matched
+				 * characters.
+				 */
+				pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+				dp += match_len;
+				found_match = true;
+				pglz_hash_init(hp, hindex, a,b,c,d);
+			}
+			else
+			{
+				/*
+				 * No match found. Copy one literal byte.
+				 */
+				pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+				dp++;			/* Do not do this ++ in the line above! */
+				/* The macro would do it four times - Jan.	*/
+			}
+
+			rollidx = 0;
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.		*/
+		}
+
+		rollidx++;
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -740,3 +1053,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98149fc..0e50f6d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool synchronize_seqscans;
+extern int	wal_update_compression_ratio;
 extern int	ssl_renegotiation_limit;
 extern char *SSLCipherSuites;
 
@@ -2383,6 +2384,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..5a40457 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
 	TransactionId old_xmax;		/* xmax of the old tuple */
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
-	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+														 * update operation is
+														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8a65492..df178a1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

pglz-with-micro-optimizations-2_roll10_32_1hashkey.patchapplication/octet-stream; name=pglz-with-micro-optimizations-2_roll10_32_1hashkey.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
 
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
 #define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+							 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+							 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5250ec7..4dcf164 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
+extern int     wal_update_compression_ratio;
 
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5806,6 +5808,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -5815,15 +5823,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole bolck in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_compression_ratio != 0 && (newtuplen >=32) && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32 enclen;
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -5850,9 +5890,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6655,7 +6698,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -6670,7 +6716,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6730,7 +6776,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6748,7 +6794,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -6773,7 +6819,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6836,10 +6882,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + New data (1 byte length + variable data)+ ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -6855,7 +6922,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a02eebc..5075d2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1236,6 +1236,28 @@ begin:;
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..2f3067f 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
  *			of identical bytes like trailing spaces) and for bigger ones
  *			our 4K maximum look-back distance is too small.
  *
- *			The compressor creates a table for 8192 lists of positions.
+ *			The compressor creates a table for lists of positions.
  *			For each input position (except the last 3), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.
+ *			back-pointers larger than that anyway.	The size of the hash
+ *			table depends on the size of the input - a larger table has
+ *			a larger startup cost, as it needs to be initialized to zero,
+ *			but reduces the number of hash collisions on long inputs.
  *
  *			For each byte in the input, it's hash key (built from this
  *			byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
  * Local definitions
  * ----------
  */
-#define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
-#define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
 #define PGLZ_HISTORY_SIZE		4096
 #define PGLZ_MAX_MATCH			273
 
@@ -198,9 +200,9 @@
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	int16		next;			/* links for my hash key's list */
+	int16		prev;
+	uint32		hindex;			/* my current hash key */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * Statically allocated work arrays for history
  * ----------
  */
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY			0
 
 /* ----------
  * pglz_hist_idx -
@@ -257,12 +261,20 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e) (												\
+#define pglz_hist_idx(_s,_e, _mask) (										\
 			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 9) ^ ((_s)[1] << 6) ^								\
-			  ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK)				\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)							\
 		)
 
+/*
+ * pglz_hash_roll() calculates the hashindex with current record using mask.
+ */
+#define pglz_hash_roll(_p,hindex,_mask)								\
+	do {																	\
+		hindex = (_p[0]) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -276,32 +288,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e));						\
-			PGLZ_HistEntry **__myhsp = &(_hs)[__hindex];					\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
+				if (__myhe->prev == INVALID_ENTRY)							\
 					(_hs)[__myhe->hindex] = __myhe->next;					\
 				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
+					(_he)[__myhe->prev].next = __myhe->next;				\
+				if (__myhe->next != INVALID_ENTRY)							\
+					(_he)[__myhe->next].prev = __myhe->prev;				\
 			}																\
 			__myhe->next = *__myhsp;										\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
-			if (*__myhsp != NULL)											\
-				(*__myhsp)->prev = __myhe;									\
-			*__myhsp = __myhe;												\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE) {								\
-				(_hn) = 0;													\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
+				(_hn) = 1;													\
 				(_recycle) = true;											\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex)	\
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = *__myhsp;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
+
 
 /* ----------
  * pglz_out_ctrl -
@@ -372,30 +401,44 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex)
 {
-	PGLZ_HistEntry *hent;
+	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hent = hstart[pglz_hist_idx(input, end)];
-	while (hent)
+	hentno = hstart[hindex];
+	while (hentno != INVALID_ENTRY)
 	{
+		PGLZ_HistEntry *hent = &hist_entries[hentno];
 		const char *ip = input;
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		maxlen = PGLZ_MAX_MATCH;
+		if (input - end > maxlen)
+			maxlen = input - end;
+		if (hend && (hend - hp > maxlen))
+			maxlen = hend - hp;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		if (!hend)
+		{
+			thisoff = ip - hp;
+			if (thisoff >= 0x0fff)
+				break;
+		}
+		else
+			thisoff = hend - hp;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -413,7 +456,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -423,7 +466,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -443,13 +486,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		hentno = hent->next;
 
 		/*
 		 * Be happy with lesser good matches the more entries we visited. But
 		 * no point in doing calculation if we're at end of list.
 		 */
-		if (hent)
+		if (hentno != INVALID_ENTRY)
 		{
 			if (len >= good_match)
 				break;
@@ -471,6 +514,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -484,7 +550,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 {
 	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
 	unsigned char *bstart = bp;
-	int			hist_next = 0;
+	int			hist_next = 1;
 	bool		hist_recycle = false;
 	const char *dp = source;
 	const char *dend = source + slen;
@@ -500,6 +566,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
+	int			hashsz;
+	int			mask;
 
 	/*
 	 * Our fallback strategy is the default.
@@ -555,17 +623,22 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Initialize the history lists to empty.  We do not need to zero the
 	 * hist_entries[] array; its entries are initialized as they are used.
 	 */
-	memset(hist_start, 0, sizeof(hist_start));
+	memset(hist_start, 0, hashsz * sizeof(int16));
 
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
+
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -588,8 +661,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop))
+							&match_off, good_match, good_drop, NULL, hindex))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -598,9 +672,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -614,7 +689,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -637,6 +712,221 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		hindex;
+	int32		rollidx;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	if (hend - hp > PGLZ_HISTORY_SIZE)
+		hp = hend - PGLZ_HISTORY_SIZE;
+
+	rollidx = 0;
+	while (hp < hend)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the new
+		 * data, too.
+		 */
+		pglz_hash_roll(hp, hindex, mask);
+		if (rollidx % 10 == 0)
+		{
+			pglz_hist_add_no_recycle(hist_start, hist_entries,
+									 hist_next,
+									 hp, hend, hindex);
+			rollidx = 0;
+		}
+		hp++;					/* Do not do this ++ in the line above! */
+		rollidx++;
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	rollidx = 0;
+	while (dp < dend)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp, hindex, mask);
+		if (rollidx % 10 == 0)
+		{
+			if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex))
+			{
+				/*
+				 * Create the tag and add history entries for all matched
+				 * characters.
+				 */
+				pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+				dp += match_len;
+				found_match = true;
+			}
+			else
+			{
+				/*
+				 * No match found. Copy one literal byte.
+				 */
+				pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+				dp++;			/* Do not do this ++ in the line above! */
+				/* The macro would do it four times - Jan.	*/
+			}
+
+			rollidx = 0;
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.		*/
+		}
+
+		rollidx++;
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -740,3 +1030,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98149fc..0e50f6d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool synchronize_seqscans;
+extern int	wal_update_compression_ratio;
 extern int	ssl_renegotiation_limit;
 extern char *SSLCipherSuites;
 
@@ -2383,6 +2384,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..5a40457 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
 	TransactionId old_xmax;		/* xmax of the old tuple */
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
-	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+														 * update operation is
+														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8a65492..df178a1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal.patchapplication/octet-stream; name=pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
 
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
 #define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+							 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+							 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5250ec7..4dcf164 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
+extern int     wal_update_compression_ratio;
 
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5806,6 +5808,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -5815,15 +5823,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole bolck in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_compression_ratio != 0 && (newtuplen >=32) && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32 enclen;
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -5850,9 +5890,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6655,7 +6698,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -6670,7 +6716,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6730,7 +6776,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6748,7 +6794,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -6773,7 +6819,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6836,10 +6882,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + New data (1 byte length + variable data)+ ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -6855,7 +6922,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a02eebc..5075d2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1236,6 +1236,28 @@ begin:;
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..93e7cd0 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
  *			of identical bytes like trailing spaces) and for bigger ones
  *			our 4K maximum look-back distance is too small.
  *
- *			The compressor creates a table for 8192 lists of positions.
+ *			The compressor creates a table for lists of positions.
  *			For each input position (except the last 3), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.
+ *			back-pointers larger than that anyway.	The size of the hash
+ *			table depends on the size of the input - a larger table has
+ *			a larger startup cost, as it needs to be initialized to zero,
+ *			but reduces the number of hash collisions on long inputs.
  *
  *			For each byte in the input, it's hash key (built from this
  *			byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
  * Local definitions
  * ----------
  */
-#define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
-#define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
 #define PGLZ_HISTORY_SIZE		4096
 #define PGLZ_MAX_MATCH			273
 
@@ -198,9 +200,9 @@
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	int16		next;			/* links for my hash key's list */
+	int16		prev;
+	uint32		hindex;			/* my current hash key */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * Statically allocated work arrays for history
  * ----------
  */
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY			0
 
 /* ----------
  * pglz_hist_idx -
@@ -257,12 +261,20 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e) (												\
+#define pglz_hist_idx(_s,_e, _mask) (										\
 			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 9) ^ ((_s)[1] << 6) ^								\
-			  ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK)				\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)							\
 		)
 
+/*
+ * pglz_hash_roll() calculates the hashindex with current record using mask.
+ */
+#define pglz_hash_roll(_p,hindex,_mask)								\
+	do {																	\
+		hindex = (_p[0]) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -276,32 +288,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e));						\
-			PGLZ_HistEntry **__myhsp = &(_hs)[__hindex];					\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
+				if (__myhe->prev == INVALID_ENTRY)							\
 					(_hs)[__myhe->hindex] = __myhe->next;					\
 				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
+					(_he)[__myhe->prev].next = __myhe->next;				\
+				if (__myhe->next != INVALID_ENTRY)							\
+					(_he)[__myhe->next].prev = __myhe->prev;				\
 			}																\
 			__myhe->next = *__myhsp;										\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
-			if (*__myhsp != NULL)											\
-				(*__myhsp)->prev = __myhe;									\
-			*__myhsp = __myhe;												\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE) {								\
-				(_hn) = 0;													\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
+				(_hn) = 1;													\
 				(_recycle) = true;											\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex)	\
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = *__myhsp;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
+
 
 /* ----------
  * pglz_out_ctrl -
@@ -372,30 +401,44 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex)
 {
-	PGLZ_HistEntry *hent;
+	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hent = hstart[pglz_hist_idx(input, end)];
-	while (hent)
+	hentno = hstart[hindex];
+	while (hentno != INVALID_ENTRY)
 	{
+		PGLZ_HistEntry *hent = &hist_entries[hentno];
 		const char *ip = input;
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		maxlen = PGLZ_MAX_MATCH;
+		if (input - end > maxlen)
+			maxlen = input - end;
+		if (hend && (hend - hp > maxlen))
+			maxlen = hend - hp;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
-		if (thisoff >= 0x0fff)
-			break;
+		if (!hend)
+		{
+			thisoff = ip - hp;
+			if (thisoff >= 0x0fff)
+				break;
+		}
+		else
+			thisoff = hend - hp;
 
 		/*
 		 * Determine length of match. A better match must be larger than the
@@ -413,7 +456,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -423,7 +466,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -443,13 +486,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		hentno = hent->next;
 
 		/*
 		 * Be happy with lesser good matches the more entries we visited. But
 		 * no point in doing calculation if we're at end of list.
 		 */
-		if (hent)
+		if (hentno != INVALID_ENTRY)
 		{
 			if (len >= good_match)
 				break;
@@ -471,6 +514,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -484,7 +550,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 {
 	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
 	unsigned char *bstart = bp;
-	int			hist_next = 0;
+	int			hist_next = 1;
 	bool		hist_recycle = false;
 	const char *dp = source;
 	const char *dend = source + slen;
@@ -500,6 +566,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
+	int			hashsz;
+	int			mask;
 
 	/*
 	 * Our fallback strategy is the default.
@@ -555,17 +623,22 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Initialize the history lists to empty.  We do not need to zero the
 	 * hist_entries[] array; its entries are initialized as they are used.
 	 */
-	memset(hist_start, 0, sizeof(hist_start));
+	memset(hist_start, 0, hashsz * sizeof(int16));
 
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
+
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -588,8 +661,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop))
+							&match_off, good_match, good_drop, NULL, hindex))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -598,9 +672,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -614,7 +689,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -637,6 +712,219 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		hindex;
+	int32		rollidx;
+	int32		literal_len = 0;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	if (hend - hp > PGLZ_HISTORY_SIZE)
+		hp = hend - PGLZ_HISTORY_SIZE;
+
+	rollidx = 0;
+	while (hp < hend)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the new
+		 * data, too.
+		 */
+		pglz_hash_roll(hp, hindex, mask);
+		if (rollidx % 10 == 0)
+		{
+			pglz_hist_add_no_recycle(hist_start, hist_entries,
+									 hist_next,
+									 hp, hend, hindex);
+			rollidx = 0;
+		}
+		hp++;					/* Do not do this ++ in the line above! */
+		rollidx++;
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	rollidx = 0;
+	while ((dp + literal_len) < dend)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if ((bp + literal_len) - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll((dp + literal_len), hindex, mask);
+		if (rollidx % 10 == 0)
+		{
+			if (pglz_find_match(hist_start, (dp + literal_len), dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex))
+			{
+				while (literal_len > 0)
+				{
+					/*
+					 * No match found. Copy one literal byte.
+					 */
+					pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+					dp++;		/* Do not do this ++ in the line above! */
+					/* The macro would do it four times - Jan.		*/
+					literal_len--;
+				}
+
+				/*
+				 * Create the tag and add history entries for all matched
+				 * characters.
+				 */
+				pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+				dp += match_len;
+				found_match = true;
+			}
+			else
+				literal_len++;
+
+			rollidx = 0;
+		}
+		else
+			literal_len++;
+
+		rollidx++;
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -740,3 +1028,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98149fc..0e50f6d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool synchronize_seqscans;
+extern int	wal_update_compression_ratio;
 extern int	ssl_renegotiation_limit;
 extern char *SSLCipherSuites;
 
@@ -2383,6 +2384,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..5a40457 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
 	TransactionId old_xmax;		/* xmax of the old tuple */
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
-	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+														 * update operation is
+														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8a65492..df178a1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

test_readings.txttext/plain; name=test_readings.txtDownload

#18

Amit Kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Amit Kapila (#17)

3 attachment(s)

On Friday, March 08, 2013 9:22 PM Amit Kapila wrote:

On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous

patch):

I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll
begin with some numbers:

Based on your patch, I have tried some more optimizations:

Fixed bug in your patch (pglz-with-micro-optimizations-2):
1. There were some problems in recovery due to wrong length of oldtuple
passed in decode which I have corrected.

Approach -1 (pglz-with-micro-optimizations-2_roll10_32)
1. Move strategy min length (32) check in log_heap_update 2. Rolling 10
for hash as suggested by you is added.

Approach -2 (pglz-with-micro-optimizations-2_roll10_32_1hashkey)
1. This is done on top of Approach-1 changes 2. Used 1 byte data as the
hash key.

Approach-3
(pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal)
1. This is done on top of Approach-1 and Approach-2 changes 2. Instead
of doing copy of literal byte when it founds as non match with history,
do all in a batch.

Data for all above approaches is in attached file "test_readings"
(Apart from your tests, I have added one more test " hundred tiny
fields, first 10
changed")

Summary -
After changes of Approach-1, CPU utilization for all except 2 tests
("hundred tiny fields, all changed", "hundred tiny fields, half
changed") is either same or less. The best case CPU utilization has
decreased (which is better), but WAL reduction has little bit increased
(which is as per expectation due 10 consecutive rollup's).

Approach-2 modifications was done to see if there is any overhead of
hash calculation.
Approach-2 & Approach-3 doesn't result into any improvements.

I have investigated the reason for CPU utilization for 2 tests and the
reason is that there is nothing to compress in the new tuple and that
information it will come to know only after it processes 75%
(compression
ratio) of tuple bytes.
I think any compression algorithm will have this drawback that if data
is not compressible, it can consume time inspite of the fact that it
will not be able to compress the data.
I think most updates will update some part of tuple which will always
yield positive results.

Apart from above tests, I have run your patch against my old tests, it
yields quite positive results, WAL Reduction is more as compare to my
patch and CPU utilization is almost similar or my patch is slightly
better.
The results are in attached file "pgbench_pg_lz_mod"

The above all data is for synchronous_commit = off. I can collect the
data for synchronous_commit = on and Performance of recovery.

Data for synchronous_commit = on is as follows:

Find the data for heikki's test in file "test_readings_on.txt"

Result and observation is same as for synchronous_commit =off. In short,
Approach-1
as mentioned in above mail seems to be best.

Find the data for pg_bench based test's used in my previous tests in
"pgbench_pg_lz_mod_sync_commit_on.htm"
This has been done for Heikki's original patch and Approach-1.
It shows that there is very minor cpu dip (0.1%) in some cases and WAL
Reduction of (2~3%).
WAL reduction is not much as operations performed are less.

Recovery Performance
----------------------
pgbench org:

./pgbench -i -s 75 -F 80 postgres
./pgbench -c 4 -j 4 -T 600 postgres

pgbench 1800(rec size=1800):

./pgbench -i -s 10 -F 80 postgres
./pgbench -c 4 -j 4 -T 600 postgres

Recovery benchmark:

postgres org postgres pg lz
optimization
Recovery(sec) Recovery(sec)
pgbench org 11 11
pgbench 1800 16 11

This shows that with your patch recovery performance is also improved.

There is one more defect in recovery which is fixed in attached patch
pglz-with-micro-optimizations-3.patch.
In pglz_find_match(), it was going beyond maxlen for comparision due to
which encoded data was not properly written to WAL.

Finally, as per my work further to your patch, the best patch will be by
fixing recovery defects and changes for Approach-1.

With Regards,
Amit Kapila.

Attachments:

test_readings_on.txttext/plain; name=test_readings_on.txtDownload

pgbench_pg_lz_mod_sync_commit_on.htmtext/html; name=pgbench_pg_lz_mod_sync_commit_on.htmDownload

pglz-with-micro-optimizations-3.patchapplication/octet-stream; name=pglz-with-micro-optimizations-3.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
 
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
 #define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+							 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+							 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5250ec7..5b69189 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
+extern int     wal_update_compression_ratio;
 
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5806,6 +5808,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -5815,15 +5823,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole bolck in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32 enclen;
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -5850,9 +5890,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6655,7 +6698,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -6670,7 +6716,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6730,7 +6776,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6748,7 +6794,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -6773,7 +6819,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6836,10 +6882,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + New data (1 byte length + variable data)+ ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -6855,7 +6922,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a02eebc..5075d2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1236,6 +1236,28 @@ begin:;
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..2aa9aaf 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
  *			of identical bytes like trailing spaces) and for bigger ones
  *			our 4K maximum look-back distance is too small.
  *
- *			The compressor creates a table for 8192 lists of positions.
+ *			The compressor creates a table for lists of positions.
  *			For each input position (except the last 3), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.
+ *			back-pointers larger than that anyway.  The size of the hash
+ *			table depends on the size of the input - a larger table has
+ *			a larger startup cost, as it needs to be initialized to zero,
+ *			but reduces the number of hash collisions on long inputs.
  *
  *			For each byte in the input, it's hash key (built from this
  *			byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
  * Local definitions
  * ----------
  */
-#define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
-#define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
 #define PGLZ_HISTORY_SIZE		4096
 #define PGLZ_MAX_MATCH			273
 
@@ -198,9 +200,9 @@
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	int16		next;			/* links for my hash key's list */
+	int16		prev;
+	uint32		hindex;			/* my current hash key */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * Statically allocated work arrays for history
  * ----------
  */
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY			0
 
 /* ----------
  * pglz_hist_idx -
@@ -257,12 +261,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e) (												\
+#define pglz_hist_idx(_s,_e, _mask) (										\
 			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 9) ^ ((_s)[1] << 6) ^								\
-			  ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK)				\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)							\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) 									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) 							\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -276,32 +308,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e));						\
-			PGLZ_HistEntry **__myhsp = &(_hs)[__hindex];					\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
+				if (__myhe->prev == INVALID_ENTRY)							\
 					(_hs)[__myhe->hindex] = __myhe->next;					\
 				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
+					(_he)[__myhe->prev].next = __myhe->next;				\
+				if (__myhe->next != INVALID_ENTRY)							\
+					(_he)[__myhe->next].prev = __myhe->prev;				\
 			}																\
 			__myhe->next = *__myhsp;										\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
-			if (*__myhsp != NULL)											\
-				(*__myhsp)->prev = __myhe;									\
-			*__myhsp = __myhe;												\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE) {								\
-				(_hn) = 0;													\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
+				(_hn) = 1;													\
 				(_recycle) = true;											\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex)	\
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = *__myhsp;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
+
 
 /* ----------
  * pglz_out_ctrl -
@@ -372,28 +421,42 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex)
 {
-	PGLZ_HistEntry *hent;
+	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hent = hstart[pglz_hist_idx(input, end)];
-	while (hent)
+	hentno = hstart[hindex];
+	while (hentno != INVALID_ENTRY)
 	{
+		PGLZ_HistEntry *hent = &hist_entries[hentno];
 		const char *ip = input;
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		maxlen = PGLZ_MAX_MATCH;
+		if (input - end < maxlen)
+			maxlen = input - end;
+		if (hend && (hend - hp < maxlen))
+			maxlen = hend - hp;
+
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
+		if (!hend)
+			thisoff = ip - hp;
+		else
+			thisoff = hend - hp;
+
 		if (thisoff >= 0x0fff)
 			break;
 
@@ -413,7 +476,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -423,7 +486,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -443,13 +506,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		hentno = hent->next;
 
 		/*
 		 * Be happy with lesser good matches the more entries we visited. But
 		 * no point in doing calculation if we're at end of list.
 		 */
-		if (hent)
+		if (hentno != INVALID_ENTRY)
 		{
 			if (len >= good_match)
 				break;
@@ -471,6 +534,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For
+	 * a small input, the startup cost dominates. The table size must be
+	 * a power of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -484,7 +570,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 {
 	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
 	unsigned char *bstart = bp;
-	int			hist_next = 0;
+	int			hist_next = 1;
 	bool		hist_recycle = false;
 	const char *dp = source;
 	const char *dend = source + slen;
@@ -500,6 +586,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
+	int			hashsz;
+	int			mask;
 
 	/*
 	 * Our fallback strategy is the default.
@@ -555,17 +643,21 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Initialize the history lists to empty.  We do not need to zero the
 	 * hist_entries[] array; its entries are initialized as they are used.
 	 */
-	memset(hist_start, 0, sizeof(hist_start));
+	memset(hist_start, 0, hashsz * sizeof(int16));
 
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -588,8 +680,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop))
+							&match_off, good_match, good_drop, NULL, hindex))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -598,9 +691,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -614,7 +708,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -637,6 +731,198 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,b,c,d;
+	int32		hindex;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	pglz_hash_init(hp, hindex, a,b,c,d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the
+		 * new data, too.
+		 */
+		pglz_hash_roll(hp, hindex, a,b,c,d, mask);
+		pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 hp, hend, hindex);
+		hp++;			/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, hindex,a,b,c,d);
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp,hindex,a,b,c,d,mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			dp += match_len;
+			found_match = true;
+			pglz_hash_init(dp, hindex,a,b,c,d);
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.	*/
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;				/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -740,3 +1026,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history
+				 * to OUTPUT.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just
+				 * copy one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)	/* check for buffer overrun */
+					break;	/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98149fc..0e50f6d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -123,6 +123,7 @@ extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool synchronize_seqscans;
+extern int	wal_update_compression_ratio;
 extern int	ssl_renegotiation_limit;
 extern char *SSLCipherSuites;
 
@@ -2383,6 +2384,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 270924a..5a40457 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
 	TransactionId old_xmax;		/* xmax of the old tuple */
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
-	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+														 * update operation is
+														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 8a65492..df178a1 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#19

Amit Kapila

amit.kapila@huawei.com

almost 13 years ago

In reply to: Amit Kapila (#18)

3 attachment(s)

On Wednesday, March 13, 2013 5:50 PM Amit Kapila wrote:

On Friday, March 08, 2013 9:22 PM Amit Kapila wrote:

On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous

patch):

I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll
begin with some numbers:

Based on your patch, I have tried some more optimizations:

Based on numbers provided by Daniel for compression methods, I tried Snappy
Algorithm for encoding
and it addresses most of your concerns that it should not degrade
performance for majority cases.

postgres orginal:

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1232916160 | 34.0338308811188
two short fields, one changed | 1232909704 | 32.8722319602966
two short fields, both changed | 1236770128 | 35.445415019989
one short and one long field, no change | 1053000144 | 23.2983899116516
ten tiny fields, all changed | 1397452584 | 40.2718069553375
hundred tiny fields, first 10 changed | 622082664 | 21.7642788887024
hundred tiny fields, all changed | 626461528 | 20.964781999588
hundred tiny fields, half changed | 621900472 | 21.6473519802094
hundred tiny fields, half nulled | 557714752 | 19.0088789463043
(9 rows)

postgres encode wal using snappy:

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1232915128 | 34.6910920143127
two short fields, one changed | 1238902520 | 34.2287850379944
two short fields, both changed | 1233882056 | 35.3292708396912
one short and one long field, no change | 733095168 | 20.3494939804077
ten tiny fields, all changed | 1314959744 | 38.969575881958
hundred tiny fields, first 10 changed | 483275136 | 19.6973309516907
hundred tiny fields, all changed | 481755280 | 19.7665288448334
hundred tiny fields, half changed | 488693616 | 19.7246761322021
hundred tiny fields, half nulled | 483425712 | 18.6299569606781
(9 rows)

Changes are to call snappy compress and decompress for encoding and decoding
in patch.
I am doing encoding for tup length greater than 32, as for too small tuples
it might not make much sense for encoding.

On my m/c while using snapy compress/decompress, it was giving stack
corruption for first 4 bytes, so I put below fix to proceed.
I am looking into reason of same.
1. snappy_compress - Increment the encoded data buffer with 4 bytes before
encryption starts.
2. snappy_uncompress - Decrement the 4 bytes increment done during compress.

3. snappy_uncompressed_length - Decrement the 4 bytes increment done during
compress.

For LZ compression patch, there was small problem in identifying max length
which I have corrected in separate patch
'pglz-with-micro-optimizations-4.patch'

In my opinion, there can be following ways for this patch:
1. Use LZ compression, but provide a way to user so that it can be avoided
for cases where much compression is not possible.
I see this as a viable way because most updates will update only have few
columns and rest data would be same.
2. Use snappy API's, do anyone know of standard library of snappy?
3. Provide multiple compression ways, so depending on usage, user can use
appropriate one.

Feedback?

With Regards,
Amit Kapila.

Attachments:

snappy_algo_v1.patchapplication/octet-stream; name=snappy_algo_v1.patchDownload

*** a/src/backend/utils/adt/Makefile
--- b/src/backend/utils/adt/Makefile
***************
*** 31,37 **** OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
  	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
  	tsvector.o tsvector_op.o tsvector_parser.o \
  	txid.o uuid.o windowfuncs.o xml.o rangetypes_spgist.o \
! 	rangetypes_typanalyze.o rangetypes_selfuncs.o
  
  like.o: like.c like_match.c
  
--- 31,37 ----
  	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
  	tsvector.o tsvector_op.o tsvector_parser.o \
  	txid.o uuid.o windowfuncs.o xml.o rangetypes_spgist.o \
! 	rangetypes_typanalyze.o rangetypes_selfuncs.o snappy.o
  
  like.o: like.c like_match.c
  
*** /dev/null
--- b/src/backend/utils/adt/snappy.c
***************
*** 0 ****
--- 1,1334 ----
+ /*
+  * C port of the snappy compressor from Google.
+  * This is a very fast compressor with comparable compression to lzo.
+  * Works best on 64bit little-endian, but should be good on others too.
+  * Ported by Andi Kleen.
+  * Based on snappy 1.0.3 plus some selected changes from SVN.
+  */
+ 
+ /*
+  * Copyright 2005 Google Inc. All Rights Reserved.
+  *
+  * Redistribution and use in source and binary forms, with or without
+  * modification, are permitted provided that the following conditions are
+  * met:
+  *
+  *     * Redistributions of source code must retain the above copyright
+  * notice, this list of conditions and the following disclaimer.
+  *     * Redistributions in binary form must reproduce the above
+  * copyright notice, this list of conditions and the following disclaimer
+  * in the documentation and/or other materials provided with the
+  * distribution.
+  *     * Neither the name of Google Inc. nor the names of its
+  * contributors may be used to endorse or promote products derived from
+  * this software without specific prior written permission.
+  *
+  * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+  * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+  * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+  * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+  * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+  * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+  * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+  * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+  * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+  */
+ 
+ #ifdef __KERNEL__
+ #include <linux/kernel.h>
+ #include <linux/module.h>
+ #include <linux/slab.h>
+ #include <linux/string.h>
+ #include <linux/snappy.h>
+ #include <linux/vmalloc.h>
+ #include <asm/unaligned.h>
+ #else
+ #include <stdbool.h>
+ #include <stddef.h>
+ #include "utils/snappy.h"
+ #include "utils/compat.h"
+ #endif
+ 
+ #define CRASH_UNLESS(x) BUG_ON(!(x))
+ #define CHECK(cond) CRASH_UNLESS(cond)
+ #define CHECK_LE(a, b) CRASH_UNLESS((a) <= (b))
+ #define CHECK_GE(a, b) CRASH_UNLESS((a) >= (b))
+ #define CHECK_EQ(a, b) CRASH_UNLESS((a) == (b))
+ #define CHECK_NE(a, b) CRASH_UNLESS((a) != (b))
+ #define CHECK_LT(a, b) CRASH_UNLESS((a) < (b))
+ #define CHECK_GT(a, b) CRASH_UNLESS((a) > (b))
+ 
+ #define UNALIGNED_LOAD16(_p) get_unaligned((u16 *)(_p))
+ #define UNALIGNED_LOAD32(_p) get_unaligned((u32 *)(_p))
+ #define UNALIGNED_LOAD64(_p) get_unaligned((u64 *)(_p))
+ 
+ #define UNALIGNED_STORE16(_p, _val) put_unaligned(_val, (u16 *)(_p))
+ #define UNALIGNED_STORE32(_p, _val) put_unaligned(_val, (u32 *)(_p))
+ #define UNALIGNED_STORE64(_p, _val) put_unaligned(_val, (u64 *)(_p))
+ 
+ #ifdef NDEBUG
+ 
+ #define DCHECK(cond) do {} while(0)
+ #define DCHECK_LE(a, b) do {} while(0)
+ #define DCHECK_GE(a, b) do {} while(0)
+ #define DCHECK_EQ(a, b) do {} while(0)
+ #define DCHECK_NE(a, b) do {} while(0)
+ #define DCHECK_LT(a, b) do {} while(0)
+ #define DCHECK_GT(a, b) do {} while(0)
+ 
+ #else
+ 
+ #define DCHECK(cond) CHECK(cond)
+ #define DCHECK_LE(a, b) CHECK_LE(a, b)
+ #define DCHECK_GE(a, b) CHECK_GE(a, b)
+ #define DCHECK_EQ(a, b) CHECK_EQ(a, b)
+ #define DCHECK_NE(a, b) CHECK_NE(a, b)
+ #define DCHECK_LT(a, b) CHECK_LT(a, b)
+ #define DCHECK_GT(a, b) CHECK_GT(a, b)
+ 
+ #endif
+ 
+ static inline bool is_little_endian(void)
+ {
+ #ifdef __LITTLE_ENDIAN__
+ 	return true;
+ #endif
+ 	return false;
+ }
+ 
+ static inline int log2_floor(u32 n)
+ {
+ 	return n == 0 ? -1 : 31 ^ __builtin_clz(n);
+ }
+ 
+ static inline int find_lsb_set_non_zero(u32 n)
+ {
+ 	return __builtin_ctz(n);
+ }
+ 
+ static inline int find_lsb_set_non_zero64(u64 n)
+ {
+ 	return __builtin_ctzll(n);
+ }
+ 
+ #define kmax32 5
+ 
+ /*
+  * Attempts to parse a varint32 from a prefix of the bytes in [ptr,limit-1].
+  *  Never reads a character at or beyond limit.  If a valid/terminated varint32
+  * was found in the range, stores it in *OUTPUT and returns a pointer just
+  * past the last byte of the varint32. Else returns NULL.  On success,
+  * "result <= limit".
+  */
+ static inline const char *varint_parse32_with_limit(const char *p,
+ 						    const char *l,
+ 						    u32 * OUTPUT)
+ {
+ 	const unsigned char *ptr = (const unsigned char *)(p);
+ 	const unsigned char *limit = (const unsigned char *)(l);
+ 	u32 b, result;
+ 
+ 	if (ptr >= limit)
+ 		return NULL;
+ 	b = *(ptr++);
+ 	result = b & 127;
+ 	if (b < 128)
+ 		goto done;
+ 	if (ptr >= limit)
+ 		return NULL;
+ 	b = *(ptr++);
+ 	result |= (b & 127) << 7;
+ 	if (b < 128)
+ 		goto done;
+ 	if (ptr >= limit)
+ 		return NULL;
+ 	b = *(ptr++);
+ 	result |= (b & 127) << 14;
+ 	if (b < 128)
+ 		goto done;
+ 	if (ptr >= limit)
+ 		return NULL;
+ 	b = *(ptr++);
+ 	result |= (b & 127) << 21;
+ 	if (b < 128)
+ 		goto done;
+ 	if (ptr >= limit)
+ 		return NULL;
+ 	b = *(ptr++);
+ 	result |= (b & 127) << 28;
+ 	if (b < 16)
+ 		goto done;
+ 	return NULL;		/* Value is too long to be a varint32 */
+ done:
+ 	*OUTPUT = result;
+ 	return (const char *)(ptr);
+ }
+ 
+ /*
+  * REQUIRES   "ptr" points to a buffer of length sufficient to hold "v".
+  *  EFFECTS    Encodes "v" into "ptr" and returns a pointer to the
+  *            byte just past the last encoded byte.
+  */
+ static inline char *varint_encode32(char *sptr, u32 v)
+ {
+ 	/* Operate on characters as unsigneds */
+ 	unsigned char *ptr = (unsigned char *)(sptr);
+ 	static const int B = 128;
+ 
+ 	if (v < (1 << 7)) {
+ 		*(ptr++) = v;
+ 	} else if (v < (1 << 14)) {
+ 		*(ptr++) = v | B;
+ 		*(ptr++) = v >> 7;
+ 	} else if (v < (1 << 21)) {
+ 		*(ptr++) = v | B;
+ 		*(ptr++) = (v >> 7) | B;
+ 		*(ptr++) = v >> 14;
+ 	} else if (v < (1 << 28)) {
+ 		*(ptr++) = v | B;
+ 		*(ptr++) = (v >> 7) | B;
+ 		*(ptr++) = (v >> 14) | B;
+ 		*(ptr++) = v >> 21;
+ 	} else {
+ 		*(ptr++) = v | B;
+ 		*(ptr++) = (v >> 7) | B;
+ 		*(ptr++) = (v >> 14) | B;
+ 		*(ptr++) = (v >> 21) | B;
+ 		*(ptr++) = v >> 28;
+ 	}
+ 	return (char *)(ptr);
+ }
+ 
+ struct source {
+ 	const char *ptr;
+ 	size_t left;
+ };
+ 
+ static inline int available(struct source *s)
+ {
+ 	return s->left;
+ }
+ 
+ static inline const char *peek(struct source *s, size_t * len)
+ {
+ 	*len = s->left;
+ 	return s->ptr;
+ }
+ 
+ static inline void skip(struct source *s, size_t n)
+ {
+ 	s->left -= n;
+ 	s->ptr += n;
+ }
+ 
+ struct sink {
+ 	char *dest;
+ };
+ 
+ static inline void append(struct sink *s, const char *data, size_t n)
+ {
+ 	if (data != s->dest)
+ 		memcpy(s->dest, data, n);
+ 	s->dest += n;
+ }
+ 
+ static inline void *sink_peek(struct sink *s, size_t n)
+ {
+ 	return s->dest;
+ }
+ 
+ struct writer {
+ 	char *base;
+ 	char *op;
+ 	char *op_limit;
+ };
+ 
+ /* Called before decompression */
+ static inline void writer_set_expected_length(struct writer *w, size_t len)
+ {
+ 	w->op_limit = w->op + len;
+ }
+ 
+ /* Called after decompression */
+ static inline bool writer_check_length(struct writer *w)
+ {
+ 	return w->op == w->op_limit;
+ }
+ 
+ /*
+  * Copy "len" bytes from "src" to "op", one byte at a time.  Used for
+  *  handling COPY operations where the input and output regions may
+  * overlap.  For example, suppose:
+  *    src    == "ab"
+  *    op     == src + 2
+  *    len    == 20
+  * After IncrementalCopy(src, op, len), the result will have
+  * eleven copies of "ab"
+  *    ababababababababababab
+  * Note that this does not match the semantics of either memcpy()
+  * or memmove().
+  */
+ static inline void incremental_copy(const char *src, char *op, int len)
+ {
+ 	DCHECK_GT(len, 0);
+ 	do {
+ 		*op++ = *src++;
+ 	} while (--len > 0);
+ }
+ 
+ /*
+  * Equivalent to IncrementalCopy except that it can write up to ten extra
+  *  bytes after the end of the copy, and that it is faster.
+  *
+  * The main part of this loop is a simple copy of eight bytes at a time until
+  * we've copied (at least) the requested amount of bytes.  However, if op and
+  * src are less than eight bytes apart (indicating a repeating pattern of
+  * length < 8), we first need to expand the pattern in order to get the correct
+  * results. For instance, if the buffer looks like this, with the eight-byte
+  * <src> and <op> patterns marked as intervals:
+  *
+  *    abxxxxxxxxxxxx
+  *    [------]           src
+  *      [------]         op
+  *
+  * a single eight-byte copy from <src> to <op> will repeat the pattern once,
+  * after which we can move <op> two bytes without moving <src>:
+  *
+  *    ababxxxxxxxxxx
+  *    [------]           src
+  *        [------]       op
+  *
+  * and repeat the exercise until the two no longer overlap.
+  *
+  * This allows us to do very well in the special case of one single byte
+  * repeated many times, without taking a big hit for more general cases.
+  *
+  * The worst case of extra writing past the end of the match occurs when
+  * op - src == 1 and len == 1; the last copy will read from byte positions
+  * [0..7] and write to [4..11], whereas it was only supposed to write to
+  * position 1. Thus, ten excess bytes.
+  */
+ 
+ #define kmax_increment_copy_overflow  10
+ 
+ static inline void incremental_copy_fast_path(const char *src, char *op,
+ 					      int len)
+ {
+ 	while (op - src < 8) {
+ 		UNALIGNED_STORE64(op, UNALIGNED_LOAD64(src));
+ 		len -= op - src;
+ 		op += op - src;
+ 	}
+ 	while (len > 0) {
+ 		UNALIGNED_STORE64(op, UNALIGNED_LOAD64(src));
+ 		src += 8;
+ 		op += 8;
+ 		len -= 8;
+ 	}
+ }
+ 
+ static inline bool writer_append_from_self(struct writer *w, u32 offset,
+ 					   u32 len)
+ {
+ 	char *op = w->op;
+ 	const int space_left = w->op_limit - op;
+ 
+ 	if (op - w->base <= offset - 1u)	/* -1u catches offset==0 */
+ 		return false;
+ 	if (len <= 16 && offset >= 8 && space_left >= 16) {
+ 		/* Fast path, used for the majority (70-80%) of dynamic
+ 		 * invocations. */
+ 		UNALIGNED_STORE64(op, UNALIGNED_LOAD64(op - offset));
+ 		UNALIGNED_STORE64(op + 8, UNALIGNED_LOAD64(op - offset + 8));
+ 	} else {
+ 		if (space_left >= len + kmax_increment_copy_overflow) {
+ 			incremental_copy_fast_path(op - offset, op, len);
+ 		} else {
+ 			if (space_left < len) {
+ 				return false;
+ 			}
+ 			incremental_copy(op - offset, op, len);
+ 		}
+ 	}
+ 
+ 	w->op = op + len;
+ 	return true;
+ }
+ 
+ static inline bool writer_append(struct writer *w, const char *ip, u32 len)
+ {
+ 	char *op = w->op;
+ 	const int space_left = w->op_limit - op;
+ 	if (space_left < len)
+ 		return false;
+ 	memcpy(op, ip, len);
+ 	w->op = op + len;
+ 	return true;
+ }
+ 
+ static inline bool writer_try_fast_append(struct writer *w, const char *ip,
+ 					  u32 available, u32 len)
+ {
+ 	char *op = w->op;
+ 	const int space_left = w->op_limit - op;
+ 	if (len <= 16 && available >= 16 && space_left >= 16) {
+ 		/* Fast path, used for the majority (~95%) of invocations */
+ 		UNALIGNED_STORE64(op, UNALIGNED_LOAD64(ip));
+ 		UNALIGNED_STORE64(op + 8, UNALIGNED_LOAD64(ip + 8));
+ 		w->op = op + len;
+ 		return true;
+ 	}
+ 	return false;
+ }
+ 
+ /*
+  * Any hash function will produce a valid compressed bitstream, but a good
+  * hash function reduces the number of collisions and thus yields better
+  * compression for compressible input, and more speed for incompressible
+  * input. Of course, it doesn't hurt if the hash function is reasonably fast
+  * either, as it gets called a lot.
+  */
+ static inline u32 hash_bytes(u32 bytes, int shift)
+ {
+ 	u32 kmul = 0x1e35a7bd;
+ 	return (bytes * kmul) >> shift;
+ }
+ 
+ static inline u32 hash(const char *p, int shift)
+ {
+ 	return hash_bytes(UNALIGNED_LOAD32(p), shift);
+ }
+ 
+ /*
+  * Compressed data can be defined as:
+  *    compressed := item* literal*
+  *    item       := literal* copy
+  *
+  * The trailing literal sequence has a space blowup of at most 62/60
+  * since a literal of length 60 needs one tag byte + one extra byte
+  * for length information.
+  *
+  * Item blowup is trickier to measure.  Suppose the "copy" op copies
+  * 4 bytes of data.  Because of a special check in the encoding code,
+  * we produce a 4-byte copy only if the offset is < 65536.  Therefore
+  * the copy op takes 3 bytes to encode, and this type of item leads
+  * to at most the 62/60 blowup for representing literals.
+  *
+  * Suppose the "copy" op copies 5 bytes of data.  If the offset is big
+  * enough, it will take 5 bytes to encode the copy op.  Therefore the
+  * worst case here is a one-byte literal followed by a five-byte copy.
+  * I.e., 6 bytes of input turn into 7 bytes of "compressed" data.
+  *
+  * This last factor dominates the blowup, so the final estimate is:
+  */
+ size_t snappy_max_compressed_length(size_t source_len)
+ {
+ 	return 32 + source_len + source_len / 6;
+ }
+ EXPORT_SYMBOL(snappy_max_compressed_length);
+ 
+ enum {
+ 	LITERAL = 0,
+ 	COPY_1_BYTE_OFFSET = 1,	/* 3 bit length + 3 bits of offset in opcode */
+ 	COPY_2_BYTE_OFFSET = 2,
+ 	COPY_4_BYTE_OFFSET = 3
+ };
+ 
+ static inline char *emit_literal(char *op,
+ 				 const char *literal,
+ 				 int len, bool allow_fast_path)
+ {
+ 	int n = len - 1;	/* Zero-length literals are disallowed */
+ 
+ 	if (n < 60) {
+ 		/* Fits in tag byte */
+ 		*op++ = LITERAL | (n << 2);
+ 
+ /*
+  * The vast majority of copies are below 16 bytes, for which a
+  * call to memcpy is overkill. This fast path can sometimes
+  * copy up to 15 bytes too much, but that is okay in the
+  * main loop, since we have a bit to go on for both sides:
+  *
+  *   - The input will always have kInputMarginBytes = 15 extra
+  *     available bytes, as long as we're in the main loop, and
+  *     if not, allow_fast_path = false.
+  *   - The output will always have 32 spare bytes (see
+  *     MaxCompressedLength).
+  */
+ 		if (allow_fast_path && len <= 16) {
+ 			UNALIGNED_STORE64(op, UNALIGNED_LOAD64(literal));
+ 			UNALIGNED_STORE64(op + 8,
+ 					  UNALIGNED_LOAD64(literal + 8));
+ 			return op + len;
+ 		}
+ 	} else {
+ 		/* Encode in upcoming bytes */
+ 		char *base = op;
+ 		int count = 0;
+ 		op++;
+ 		while (n > 0) {
+ 			*op++ = n & 0xff;
+ 			n >>= 8;
+ 			count++;
+ 		}
+ 		DCHECK(count >= 1);
+ 		DCHECK(count <= 4);
+ 		*base = LITERAL | ((59 + count) << 2);
+ 	}
+ 	memcpy(op, literal, len);
+ 	return op + len;
+ }
+ 
+ static inline char *emit_copy_less_than64(char *op, int offset, int len)
+ {
+ 	DCHECK_LE(len, 64);
+ 	DCHECK_GE(len, 4);
+ 	DCHECK_LT(offset, 65536);
+ 
+ 	if ((len < 12) && (offset < 2048)) {
+ 		int len_minus_4 = len - 4;
+ 		DCHECK(len_minus_4 < 8);	/* Must fit in 3 bits */
+ 		*op++ =
+ 		    COPY_1_BYTE_OFFSET | ((len_minus_4) << 2) | ((offset >> 8)
+ 								 << 5);
+ 		*op++ = offset & 0xff;
+ 	} else {
+ 		*op++ = COPY_2_BYTE_OFFSET | ((len - 1) << 2);
+ 		put_unaligned_le16(offset, op);
+ 		op += 2;
+ 	}
+ 	return op;
+ }
+ 
+ static inline char *emit_copy(char *op, int offset, int len)
+ {
+ 	/*
+ 	 * Emit 64 byte copies but make sure to keep at least four bytes
+ 	 * reserved
+ 	 */
+ 	while (len >= 68) {
+ 		op = emit_copy_less_than64(op, offset, 64);
+ 		len -= 64;
+ 	}
+ 
+ 	/*
+ 	 * Emit an extra 60 byte copy if have too much data to fit in
+ 	 * one copy
+ 	 */
+ 	if (len > 64) {
+ 		op = emit_copy_less_than64(op, offset, 60);
+ 		len -= 60;
+ 	}
+ 
+ 	/* Emit remainder */
+ 	op = emit_copy_less_than64(op, offset, len);
+ 	return op;
+ }
+ 
+ /**
+  * snappy_uncompressed_length - return length of uncompressed output.
+  * @start: compressed buffer
+  * @n: length of compressed buffer.
+  * @result: Write the length of the uncompressed output here.
+  *
+  * Returns true when successfull, otherwise false.
+  */
+ bool snappy_uncompressed_length(const char *start, size_t n, size_t * result)
+ {
+ 	u32 v = 0;
+ 
+ 	/* Temp fix of 4 bytes decrement, because compress add 4 bytes extra */
+ 	const char *limit = (start + 4) + (n - 4);
+ 	if (varint_parse32_with_limit(start, limit, &v) != NULL) {
+ 		*result = v;
+ 		return true;
+ 	} else {
+ 		return false;
+ 	}
+ }
+ EXPORT_SYMBOL(snappy_uncompressed_length);
+ 
+ #define kblock_log 15
+ #define kblock_size (1 << kblock_log)
+ 
+ /*
+  * This value could be halfed or quartered to save memory
+  * at the cost of slightly worse compression.
+  */
+ #define kmax_hash_table_bits 14
+ #define kmax_hash_table_size (1 << kmax_hash_table_bits)
+ 
+ /*
+  * Use smaller hash table when input.size() is smaller, since we
+  * fill the table, incurring O(hash table size) overhead for
+  * compression, and if the input is short, we won't need that
+  * many hash table entries anyway.
+  */
+ static u16 *get_hash_table(struct snappy_env *env, size_t input_size,
+ 			      int *table_size)
+ {
+ 	int htsize = 256;
+ 
+ 	DCHECK(kmax_hash_table_size >= 256);
+ 	while (htsize < kmax_hash_table_size && htsize < input_size)
+ 		htsize <<= 1;
+ 	CHECK_EQ(0, htsize & (htsize - 1));
+ 	CHECK_LE(htsize, kmax_hash_table_size);
+ 
+ 	u16 *table;
+ 	table = env->hash_table;
+ 
+ 	*table_size = htsize;
+ 	memset(table, 0, htsize * sizeof(*table));
+ 	return table;
+ }
+ 
+ /*
+  * Return the largest n such that
+  *
+  *   s1[0,n-1] == s2[0,n-1]
+  *   and n <= (s2_limit - s2).
+  *
+  * Does not read *s2_limit or beyond.
+  * Does not read *(s1 + (s2_limit - s2)) or beyond.
+  * Requires that s2_limit >= s2.
+  *
+  * Separate implementation for x86_64, for speed.  Uses the fact that
+  * x86_64 is little endian.
+  */
+ #if defined(__LITTLE_ENDIAN__) && BITS_PER_LONG == 64
+ static inline int find_match_length(const char *s1,
+ 				    const char *s2, const char *s2_limit)
+ {
+ 	int matched = 0;
+ 
+ 	DCHECK_GE(s2_limit, s2);
+ 	/*
+ 	 * Find out how long the match is. We loop over the data 64 bits at a
+ 	 * time until we find a 64-bit block that doesn't match; then we find
+ 	 * the first non-matching bit and use that to calculate the total
+ 	 * length of the match.
+ 	 */
+ 	while (likely(s2 <= s2_limit - 8)) {
+ 		if (unlikely
+ 		    (UNALIGNED_LOAD64(s2) == UNALIGNED_LOAD64(s1 + matched))) {
+ 			s2 += 8;
+ 			matched += 8;
+ 		} else {
+ 			/*
+ 			 * On current (mid-2008) Opteron models there
+ 			 * is a 3% more efficient code sequence to
+ 			 * find the first non-matching byte.  However,
+ 			 * what follows is ~10% better on Intel Core 2
+ 			 * and newer, and we expect AMD's bsf
+ 			 * instruction to improve.
+ 			 */
+ 			u64 x =
+ 			    UNALIGNED_LOAD64(s2) ^ UNALIGNED_LOAD64(s1 +
+ 								    matched);
+ 			int matching_bits = find_lsb_set_non_zero64(x);
+ 			matched += matching_bits >> 3;
+ 			return matched;
+ 		}
+ 	}
+ 	while (likely(s2 < s2_limit)) {
+ 		if (likely(s1[matched] == *s2)) {
+ 			++s2;
+ 			++matched;
+ 		} else {
+ 			return matched;
+ 		}
+ 	}
+ 	return matched;
+ }
+ #else
+ static inline int find_match_length(const char *s1,
+ 				    const char *s2, const char *s2_limit)
+ {
+ 	/* Implementation based on the x86-64 version, above. */
+ 	DCHECK_GE(s2_limit, s2);
+ 	int matched = 0;
+ 
+ 	while (s2 <= s2_limit - 4 &&
+ 	       UNALIGNED_LOAD32(s2) == UNALIGNED_LOAD32(s1 + matched)) {
+ 		s2 += 4;
+ 		matched += 4;
+ 	}
+ 	if (is_little_endian() && s2 <= s2_limit - 4) {
+ 		u32 x =
+ 		    UNALIGNED_LOAD32(s2) ^ UNALIGNED_LOAD32(s1 + matched);
+ 		int matching_bits = find_lsb_set_non_zero(x);
+ 		matched += matching_bits >> 3;
+ 	} else {
+ 		while ((s2 < s2_limit) && (s1[matched] == *s2)) {
+ 			++s2;
+ 			++matched;
+ 		}
+ 	}
+ 	return matched;
+ }
+ #endif
+ 
+ /*
+  * For 0 <= offset <= 4, GetU32AtOffset(UNALIGNED_LOAD64(p), offset) will
+  *  equal UNALIGNED_LOAD32(p + offset).  Motivation: On x86-64 hardware we have
+  * empirically found that overlapping loads such as
+  *  UNALIGNED_LOAD32(p) ... UNALIGNED_LOAD32(p+1) ... UNALIGNED_LOAD32(p+2)
+  * are slower than UNALIGNED_LOAD64(p) followed by shifts and casts to u32.
+  */
+ static inline u32 get_u32_at_offset(u64 v, int offset)
+ {
+ 	DCHECK(0 <= offset && offset <= 4);
+ 	return v >> (is_little_endian()? 8 * offset : 32 - 8 * offset);
+ }
+ 
+ /*
+  * Flat array compression that does not emit the "uncompressed length"
+  *  prefix. Compresses "input" string to the "*op" buffer.
+  *
+  * REQUIRES: "input" is at most "kBlockSize" bytes long.
+  * REQUIRES: "op" points to an array of memory that is at least
+  * "MaxCompressedLength(input.size())" in size.
+  * REQUIRES: All elements in "table[0..table_size-1]" are initialized to zero.
+  * REQUIRES: "table_size" is a power of two
+  *
+  * Returns an "end" pointer into "op" buffer.
+  * "end - op" is the compressed size of "input".
+  */
+ 
+ static char *compress_fragment(const char *const input,
+ 			       const size_t input_size,
+ 			       char *op, u16 * table, const int table_size)
+ {
+ 	/* "ip" is the input pointer, and "op" is the output pointer. */
+ 	const char *ip = input;
+ 	CHECK_LE(input_size, kblock_size);
+ 	CHECK_EQ(table_size & (table_size - 1), 0);
+ 	const int shift = 32 - log2_floor(table_size);
+ 	DCHECK_EQ(UINT_MAX >> shift, table_size - 1);
+ 	const char *ip_end = input + input_size;
+ 	const char *baseip = ip;
+ 	/*
+ 	 * Bytes in [next_emit, ip) will be emitted as literal bytes.  Or
+ 	 *  [next_emit, ip_end) after the main loop.
+ 	 */
+ 	const char *next_emit = ip;
+ 
+ 	const int kinput_margin_bytes = 15;
+ 
+ 	if (likely(input_size >= kinput_margin_bytes)) {
+ 		const char *ip_limit = input + input_size -
+ 			kinput_margin_bytes;
+ 
+ 		u32 next_hash;
+ 		for (next_hash = hash(++ip, shift);;) {
+ 			DCHECK_LT(next_emit, ip);
+ /*
+  * The body of this loop calls EmitLiteral once and then EmitCopy one or
+  * more times.  (The exception is that when we're close to exhausting
+  * the input we goto emit_remainder.)
+  *
+  * In the first iteration of this loop we're just starting, so
+  * there's nothing to copy, so calling EmitLiteral once is
+  * necessary.  And we only start a new iteration when the
+  * current iteration has determined that a call to EmitLiteral will
+  * precede the next call to EmitCopy (if any).
+  *
+  * Step 1: Scan forward in the input looking for a 4-byte-long match.
+  * If we get close to exhausting the input then goto emit_remainder.
+  *
+  * Heuristic match skipping: If 32 bytes are scanned with no matches
+  * found, start looking only at every other byte. If 32 more bytes are
+  * scanned, look at every third byte, etc.. When a match is found,
+  * immediately go back to looking at every byte. This is a small loss
+  * (~5% performance, ~0.1% density) for lcompressible data due to more
+  * bookkeeping, but for non-compressible data (such as JPEG) it's a huge
+  * win since the compressor quickly "realizes" the data is incompressible
+  * and doesn't bother looking for matches everywhere.
+  *
+  * The "skip" variable keeps track of how many bytes there are since the
+  * last match; dividing it by 32 (ie. right-shifting by five) gives the
+  * number of bytes to move ahead for each iteration.
+  */
+ 			u32 skip = 32;
+ 
+ 			const char *next_ip = ip;
+ 			const char *candidate;
+ 			do {
+ 				ip = next_ip;
+ 				u32 hval = next_hash;
+ 				DCHECK_EQ(hval, hash(ip, shift));
+ 				u32 bytes_between_hash_lookups = skip++ >> 5;
+ 				next_ip = ip + bytes_between_hash_lookups;
+ 				if (unlikely(next_ip > ip_limit)) {
+ 					goto emit_remainder;
+ 				}
+ 				next_hash = hash(next_ip, shift);
+ 				candidate = baseip + table[hval];
+ 				DCHECK_GE(candidate, baseip);
+ 				DCHECK_LT(candidate, ip);
+ 
+ 				table[hval] = ip - baseip;
+ 			} while (likely(UNALIGNED_LOAD32(ip) !=
+ 					UNALIGNED_LOAD32(candidate)));
+ 
+ /*
+  * Step 2: A 4-byte match has been found.  We'll later see if more
+  * than 4 bytes match.  But, prior to the match, input
+  * bytes [next_emit, ip) are unmatched.  Emit them as "literal bytes."
+  */
+ 			DCHECK_LE(next_emit + 16, ip_end);
+ 			op = emit_literal(op, next_emit, ip - next_emit, true);
+ 
+ /*
+  * Step 3: Call EmitCopy, and then see if another EmitCopy could
+  * be our next move.  Repeat until we find no match for the
+  * input immediately after what was consumed by the last EmitCopy call.
+  *
+  * If we exit this loop normally then we need to call EmitLiteral next,
+  * though we don't yet know how big the literal will be.  We handle that
+  * by proceeding to the next iteration of the main loop.  We also can exit
+  * this loop via goto if we get close to exhausting the input.
+  */
+ 			u64 input_bytes = 0;
+ 			u32 candidate_bytes = 0;
+ 
+ 			do {
+ /*
+  * We have a 4-byte match at ip, and no need to emit any
+  *  "literal bytes" prior to ip.
+  */
+ 				const char *base = ip;
+ 				int matched = 4 +
+ 				    find_match_length(candidate + 4, ip + 4,
+ 						      ip_end);
+ 				ip += matched;
+ 				int offset = base - candidate;
+ 				DCHECK_EQ(0, memcmp(base, candidate, matched));
+ 				op = emit_copy(op, offset, matched);
+ /*
+  * We could immediately start working at ip now, but to improve
+  * compression we first update table[Hash(ip - 1, ...)].
+  */
+ 				const char *insert_tail = ip - 1;
+ 				next_emit = ip;
+ 				if (unlikely(ip >= ip_limit)) {
+ 					goto emit_remainder;
+ 				}
+ 				input_bytes = UNALIGNED_LOAD64(insert_tail);
+ 				u32 prev_hash =
+ 				    hash_bytes(get_u32_at_offset
+ 					       (input_bytes, 0), shift);
+ 				table[prev_hash] = ip - baseip - 1;
+ 				u32 cur_hash =
+ 				    hash_bytes(get_u32_at_offset
+ 					       (input_bytes, 1), shift);
+ 				candidate = baseip + table[cur_hash];
+ 				candidate_bytes = UNALIGNED_LOAD32(candidate);
+ 				table[cur_hash] = ip - baseip;
+ 			} while (get_u32_at_offset(input_bytes, 1) ==
+ 				 candidate_bytes);
+ 
+ 			next_hash =
+ 			    hash_bytes(get_u32_at_offset(input_bytes, 2),
+ 				       shift);
+ 			++ip;
+ 		}
+ 	}
+ 
+ emit_remainder:
+ 	/* Emit the remaining bytes as a literal */
+ 	if (next_emit < ip_end)
+ 		op = emit_literal(op, next_emit, ip_end - next_emit, false);
+ 
+ 	return op;
+ }
+ 
+ /*
+  * -----------------------------------------------------------------------
+  *  Lookup table for decompression code.  Generated by ComputeTable() below.
+  * -----------------------------------------------------------------------
+  */
+ 
+ /* Mapping from i in range [0,4] to a mask to extract the bottom 8*i bits */
+ static const u32 wordmask[] = {
+ 	0u, 0xffu, 0xffffu, 0xffffffu, 0xffffffffu
+ };
+ 
+ /*
+  * Data stored per entry in lookup table:
+  *       Range   Bits-used       Description
+  *      ------------------------------------
+  *      1..64   0..7            Literal/copy length encoded in opcode byte
+  *      0..7    8..10           Copy offset encoded in opcode byte / 256
+  *      0..4    11..13          Extra bytes after opcode
+  *
+  * We use eight bits for the length even though 7 would have sufficed
+  * because of efficiency reasons:
+  *      (1) Extracting a byte is faster than a bit-field
+  *      (2) It properly aligns copy offset so we do not need a <<8
+  */
+ static const u16 char_table[256] = {
+ 	0x0001, 0x0804, 0x1001, 0x2001, 0x0002, 0x0805, 0x1002, 0x2002,
+ 	0x0003, 0x0806, 0x1003, 0x2003, 0x0004, 0x0807, 0x1004, 0x2004,
+ 	0x0005, 0x0808, 0x1005, 0x2005, 0x0006, 0x0809, 0x1006, 0x2006,
+ 	0x0007, 0x080a, 0x1007, 0x2007, 0x0008, 0x080b, 0x1008, 0x2008,
+ 	0x0009, 0x0904, 0x1009, 0x2009, 0x000a, 0x0905, 0x100a, 0x200a,
+ 	0x000b, 0x0906, 0x100b, 0x200b, 0x000c, 0x0907, 0x100c, 0x200c,
+ 	0x000d, 0x0908, 0x100d, 0x200d, 0x000e, 0x0909, 0x100e, 0x200e,
+ 	0x000f, 0x090a, 0x100f, 0x200f, 0x0010, 0x090b, 0x1010, 0x2010,
+ 	0x0011, 0x0a04, 0x1011, 0x2011, 0x0012, 0x0a05, 0x1012, 0x2012,
+ 	0x0013, 0x0a06, 0x1013, 0x2013, 0x0014, 0x0a07, 0x1014, 0x2014,
+ 	0x0015, 0x0a08, 0x1015, 0x2015, 0x0016, 0x0a09, 0x1016, 0x2016,
+ 	0x0017, 0x0a0a, 0x1017, 0x2017, 0x0018, 0x0a0b, 0x1018, 0x2018,
+ 	0x0019, 0x0b04, 0x1019, 0x2019, 0x001a, 0x0b05, 0x101a, 0x201a,
+ 	0x001b, 0x0b06, 0x101b, 0x201b, 0x001c, 0x0b07, 0x101c, 0x201c,
+ 	0x001d, 0x0b08, 0x101d, 0x201d, 0x001e, 0x0b09, 0x101e, 0x201e,
+ 	0x001f, 0x0b0a, 0x101f, 0x201f, 0x0020, 0x0b0b, 0x1020, 0x2020,
+ 	0x0021, 0x0c04, 0x1021, 0x2021, 0x0022, 0x0c05, 0x1022, 0x2022,
+ 	0x0023, 0x0c06, 0x1023, 0x2023, 0x0024, 0x0c07, 0x1024, 0x2024,
+ 	0x0025, 0x0c08, 0x1025, 0x2025, 0x0026, 0x0c09, 0x1026, 0x2026,
+ 	0x0027, 0x0c0a, 0x1027, 0x2027, 0x0028, 0x0c0b, 0x1028, 0x2028,
+ 	0x0029, 0x0d04, 0x1029, 0x2029, 0x002a, 0x0d05, 0x102a, 0x202a,
+ 	0x002b, 0x0d06, 0x102b, 0x202b, 0x002c, 0x0d07, 0x102c, 0x202c,
+ 	0x002d, 0x0d08, 0x102d, 0x202d, 0x002e, 0x0d09, 0x102e, 0x202e,
+ 	0x002f, 0x0d0a, 0x102f, 0x202f, 0x0030, 0x0d0b, 0x1030, 0x2030,
+ 	0x0031, 0x0e04, 0x1031, 0x2031, 0x0032, 0x0e05, 0x1032, 0x2032,
+ 	0x0033, 0x0e06, 0x1033, 0x2033, 0x0034, 0x0e07, 0x1034, 0x2034,
+ 	0x0035, 0x0e08, 0x1035, 0x2035, 0x0036, 0x0e09, 0x1036, 0x2036,
+ 	0x0037, 0x0e0a, 0x1037, 0x2037, 0x0038, 0x0e0b, 0x1038, 0x2038,
+ 	0x0039, 0x0f04, 0x1039, 0x2039, 0x003a, 0x0f05, 0x103a, 0x203a,
+ 	0x003b, 0x0f06, 0x103b, 0x203b, 0x003c, 0x0f07, 0x103c, 0x203c,
+ 	0x0801, 0x0f08, 0x103d, 0x203d, 0x1001, 0x0f09, 0x103e, 0x203e,
+ 	0x1801, 0x0f0a, 0x103f, 0x203f, 0x2001, 0x0f0b, 0x1040, 0x2040
+ };
+ 
+ struct snappy_decompressor {
+ 	struct source *reader;	/* Underlying source of bytes to decompress */
+ 	const char *ip;		/* Points to next buffered byte */
+ 	const char *ip_limit;	/* Points just past buffered bytes */
+ 	u32 peeked;		/* Bytes peeked from reader (need to skip) */
+ 	bool eof;		/* Hit end of input without an error? */
+ 	char scratch[5];	/* Temporary buffer for peekfast boundaries */
+ };
+ 
+ static void
+ init_snappy_decompressor(struct snappy_decompressor *d, struct source *reader)
+ {
+ 	d->reader = reader;
+ 	d->ip = NULL;
+ 	d->ip_limit = NULL;
+ 	d->peeked = 0;
+ 	d->eof = false;
+ }
+ 
+ static void exit_snappy_decompressor(struct snappy_decompressor *d)
+ {
+ 	skip(d->reader, d->peeked);
+ }
+ 
+ /*
+  * Read the uncompressed length stored at the start of the compressed data.
+  * On succcess, stores the length in *result and returns true.
+  * On failure, returns false.
+  */
+ static bool read_uncompressed_length(struct snappy_decompressor *d,
+ 				     u32 * result)
+ {
+ 	DCHECK(d->ip == NULL);	/*
+ 				 * Must not have read anything yet
+ 				 * Length is encoded in 1..5 bytes
+ 				 */
+ 	*result = 0;
+ 	u32 shift = 0;
+ 	while (true) {
+ 		if (shift >= 32)
+ 			return false;
+ 		size_t n;
+ 		const char *ip = peek(d->reader, &n);
+ 		if (n == 0)
+ 			return false;
+ 		const unsigned char c = *(const unsigned char *)(ip);
+ 		skip(d->reader, 1);
+ 		*result |= (u32) (c & 0x7f) << shift;
+ 		if (c < 128) {
+ 			break;
+ 		}
+ 		shift += 7;
+ 	}
+ 	return true;
+ }
+ 
+ static bool refill_tag(struct snappy_decompressor *d);
+ 
+ /*
+  * Process the next item found in the input.
+  * Returns true if successful, false on error or end of input.
+  */
+ static void decompress_all_tags(struct snappy_decompressor *d,
+ 				struct writer *writer)
+ {
+ 	const char *ip = d->ip;
+ 
+ 	/*
+ 	 * We could have put this refill fragment only at the beginning of the loop.
+ 	 * However, duplicating it at the end of each branch gives the compiler more
+ 	 * scope to optimize the <ip_limit_ - ip> expression based on the local
+ 	 * context, which overall increases speed.
+ 	 */
+ #define MAYBE_REFILL() \
+         if (d->ip_limit - ip < 5) {		\
+ 		d->ip = ip;			\
+ 		if (!refill_tag(d)) return;	\
+ 		ip = d->ip;			\
+         }
+ 
+ 
+ 	MAYBE_REFILL();
+ 	for (;;) {
+ 		if (d->ip_limit - ip < 5) {
+ 			d->ip = ip;
+ 			if (!refill_tag(d))
+ 				return;
+ 			ip = d->ip;
+ 		}
+ 
+ 		const unsigned char c = *(const unsigned char *)(ip++);
+ 
+ 		if ((c & 0x3) == LITERAL) {
+ 			u32 literal_length = (c >> 2) + 1;
+ 			if (writer_try_fast_append(writer, ip, d->ip_limit - ip,
+ 						   literal_length)) {
+ 				DCHECK_LT(literal_length, 61);
+ 				ip += literal_length;
+ 				MAYBE_REFILL();
+ 				continue;
+ 			}
+ 			if (unlikely(literal_length >= 61)) {
+ 				/* Long literal */
+ 				const u32 literal_ll = literal_length - 60;
+ 				literal_length = (get_unaligned_le32(ip) &
+ 						  wordmask[literal_ll]) + 1;
+ 				ip += literal_ll;
+ 			}
+ 
+ 			u32 avail = d->ip_limit - ip;
+ 			while (avail < literal_length) {
+ 				if (!writer_append(writer, ip, avail))
+ 					return;
+ 				literal_length -= avail;
+ 				skip(d->reader, d->peeked);
+ 				size_t n;
+ 				ip = peek(d->reader, &n);
+ 				avail = n;
+ 				d->peeked = avail;
+ 				if (avail == 0)
+ 					return;	/* Premature end of input */
+ 				d->ip_limit = ip + avail;
+ 			}
+ 			if (!writer_append(writer, ip, literal_length))
+ 				return;
+ 			ip += literal_length;
+ 			MAYBE_REFILL();
+ 		} else {
+ 			const u32 entry = char_table[c];
+ 			const u32 trailer = get_unaligned_le32(ip) &
+ 				wordmask[entry >> 11];
+ 			const u32 length = entry & 0xff;
+ 			ip += entry >> 11;
+ 
+ 			/*
+ 			 * copy_offset/256 is encoded in bits 8..10.
+ 			 * By just fetching those bits, we get
+ 			 * copy_offset (since the bit-field starts at
+ 			 * bit 8).
+ 			 */
+ 			const u32 copy_offset = entry & 0x700;
+ 			if (!writer_append_from_self(writer,
+ 						     copy_offset + trailer,
+ 						     length))
+ 				return;
+ 			MAYBE_REFILL();
+ 		}
+ 	}
+ }
+ 
+ #undef MAYBE_REFILL
+ 
+ static bool refill_tag(struct snappy_decompressor *d)
+ {
+ 	const char *ip = d->ip;
+ 
+ 	if (ip == d->ip_limit) {
+ 		size_t n;
+ 		/* Fetch a new fragment from the reader */
+ 		skip(d->reader, d->peeked); /* All peeked bytes are used up */
+ 		ip = peek(d->reader, &n);
+ 		d->peeked = n;
+ 		if (n == 0) {
+ 			d->eof = true;
+ 			return false;
+ 		}
+ 		d->ip_limit = ip + n;
+ 	}
+ 
+ 	/* Read the tag character */
+ 	DCHECK_LT(ip, d->ip_limit);
+ 	const unsigned char c = *(const unsigned char *)(ip);
+ 	const u32 entry = char_table[c];
+ 	const u32 needed = (entry >> 11) + 1;	/* +1 byte for 'c' */
+ 	DCHECK_LE(needed, sizeof(d->scratch));
+ 
+ 	/* Read more bytes from reader if needed */
+ 	u32 nbuf = d->ip_limit - ip;
+ 
+ 	if (nbuf < needed) {
+ 		/*
+ 		 * Stitch together bytes from ip and reader to form the word
+ 		 * contents.  We store the needed bytes in "scratch".  They
+ 		 * will be consumed immediately by the caller since we do not
+ 		 * read more than we need.
+ 		 */
+ 		memmove(d->scratch, ip, nbuf);
+ 		skip(d->reader, d->peeked); /* All peeked bytes are used up */
+ 		d->peeked = 0;
+ 		while (nbuf < needed) {
+ 			size_t length;
+ 			const char *src = peek(d->reader, &length);
+ 			if (length == 0)
+ 				return false;
+ 			u32 to_add = min_t(u32, needed - nbuf, length);
+ 			memcpy(d->scratch + nbuf, src, to_add);
+ 			nbuf += to_add;
+ 			skip(d->reader, to_add);
+ 		}
+ 		DCHECK_EQ(nbuf, needed);
+ 		d->ip = d->scratch;
+ 		d->ip_limit = d->scratch + needed;
+ 	} else if (nbuf < 5) {
+ 		/*
+ 		 * Have enough bytes, but move into scratch so that we do not
+ 		 * read past end of input
+ 		 */
+ 		memmove(d->scratch, ip, nbuf);
+ 		skip(d->reader, d->peeked); /* All peeked bytes are used up */
+ 		d->peeked = 0;
+ 		d->ip = d->scratch;
+ 		d->ip_limit = d->scratch + nbuf;
+ 	} else {
+ 		/* Pass pointer to buffer returned by reader. */
+ 		d->ip = ip;
+ 	}
+ 	return true;
+ }
+ 
+ static int internal_uncompress(struct source *r,
+ 			       struct writer *writer, u32 max_len)
+ {
+ 	struct snappy_decompressor decompressor;
+ 	u32 uncompressed_len = 0;
+ 
+ 	init_snappy_decompressor(&decompressor, r);
+ 
+ 	if (!read_uncompressed_length(&decompressor, &uncompressed_len))
+ 		return -EIO;
+ 	/* Protect against possible DoS attack */
+ 	if ((u64) (uncompressed_len) > max_len)
+ 		return -EIO;
+ 
+ 	writer_set_expected_length(writer, uncompressed_len);
+ 
+ 	/* Process the entire input */
+ 	decompress_all_tags(&decompressor, writer);
+ 
+ 	exit_snappy_decompressor(&decompressor);
+ 	return (decompressor.eof && writer_check_length(writer)) ? 0 : -EIO;
+ }
+ 
+ static inline int compress(struct snappy_env *env, struct source *reader,
+ 			   struct sink *writer)
+ {
+ 	int err;
+ 	size_t written = 0;
+ 	int N = available(reader);
+ 	char ulength[kmax32];
+ 	char *p = varint_encode32(ulength, N);
+ 
+ 	append(writer, ulength, p - ulength);
+ 	written += (p - ulength);
+ 
+ 	while (N > 0) {
+ 		/* Get next block to compress (without copying if possible) */
+ 		size_t fragment_size;
+ 		const char *fragment = peek(reader, &fragment_size);
+ 		if (fragment_size == 0) {
+ 			err = -EIO;
+ 			goto out;
+ 		}
+ 		const int num_to_read = min_t(int, N, kblock_size);
+ 		size_t bytes_read = fragment_size;
+ 
+ 		int pending_advance = 0;
+ 		if (bytes_read >= num_to_read) {
+ 			/* Buffer returned by reader is large enough */
+ 			pending_advance = num_to_read;
+ 			fragment_size = num_to_read;
+ 		}
+ 		else {
+ 			memcpy(env->scratch, fragment, bytes_read);
+ 			skip(reader, bytes_read);
+ 
+ 			while (bytes_read < num_to_read) {
+ 				fragment = peek(reader, &fragment_size);
+ 				size_t n =
+ 				    min_t(size_t, fragment_size,
+ 					  num_to_read - bytes_read);
+ 				memcpy(env->scratch + bytes_read, fragment, n);
+ 				bytes_read += n;
+ 				skip(reader, n);
+ 			}
+ 			DCHECK_EQ(bytes_read, num_to_read);
+ 			fragment = env->scratch;
+ 			fragment_size = num_to_read;
+ 		}
+ 		if (fragment_size < num_to_read)
+ 			return -EIO;
+ 
+ 		/* Get encoding table for compression */
+ 		int table_size;
+ 		u16 *table = get_hash_table(env, num_to_read, &table_size);
+ 
+ 		/* Compress input_fragment and append to dest */
+ 		const int max_output =
+ 		    snappy_max_compressed_length(num_to_read);
+ 
+ 		char *dest;
+ 		dest = sink_peek(writer, max_output);
+ 		if (!dest) {
+ 			/*
+ 			 * Need a scratch buffer for the output,
+ 			 * because the byte sink doesn't have enough
+ 			 * in one piece.
+ 			 */
+ 			dest = env->scratch_output;
+ 		}
+ 		char *end = compress_fragment(fragment, fragment_size,
+ 					      dest, table, table_size);
+ 		append(writer, dest, end - dest);
+ 		written += (end - dest);
+ 
+ 		N -= num_to_read;
+ 		skip(reader, pending_advance);
+ 	}
+ 
+ 	err = 0;
+ out:
+ 	return err;
+ }
+ 
+ 
+ /**
+  * snappy_compress - Compress a buffer using the snappy compressor.
+  * @env: Preallocated environment
+  * @input: Input buffer
+  * @input_length: Length of input_buffer
+  * @compressed: Output buffer for compressed data
+  * @compressed_length: The real length of the output written here.
+  *
+  * Return 0 on success, otherwise an negative error code.
+  *
+  * The output buffer must be at least
+  * snappy_max_compressed_length(input_length) bytes long.
+  *
+  * Requires a preallocated environment from snappy_init_env.
+  * The environment does not keep state over individual calls
+  * of this function, just preallocates the memory.
+  */
+ int snappy_compress(struct snappy_env *env,
+ 		    const char *input,
+ 		    size_t input_length,
+ 		    char *compressed, size_t *compressed_length)
+ {
+ 	int err;
+ 	struct source reader = {
+ 		.ptr = input,
+ 		.left = input_length
+ 	};
+ 	struct sink writer = {
+ 		.dest = compressed
+ 	};
+ 
+ 	/*Temp fix in length of first 4 bytes */
+ 	writer.dest += 4;
+ 	err = compress(env, &reader, &writer);
+ 
+ 	/* Compute how many bytes were added */
+ 	*compressed_length = (writer.dest - compressed);
+ 	return err;
+ }
+ EXPORT_SYMBOL(snappy_compress);
+ 
+ /**
+  * snappy_uncompress - Uncompress a snappy compressed buffer
+  * @compressed: Input buffer with compressed data
+  * @n: length of compressed buffer
+  * @uncompressed: buffer for uncompressed data
+  *
+  * The uncompressed data buffer must be at least
+  * snappy_uncompressed_length(compressed) bytes long.
+  *
+  * Return 0 on success, otherwise an negative error code.
+  */
+ int snappy_uncompress(const char *compressed, size_t n, char *uncompressed)
+ {
+ 	/* Temp fix of 4 bytes decrement, because compress add 4 bytes extra */
+ 	struct source reader = {
+ 		.ptr = compressed + 4,
+ 		.left = n - 4
+ 	};
+ 	struct writer output = {
+ 		.base = uncompressed,
+ 		.op = uncompressed
+ 	};
+ 	return internal_uncompress(&reader, &output, 0xffffffff);
+ }
+ EXPORT_SYMBOL(snappy_uncompress);
+ 
+ 
+ /**
+  * snappy_init_env - Allocate snappy compression environment
+  * @env: Environment to preallocate
+  *
+  * Passing multiple entries in an iovec is not allowed
+  * on the environment allocated here.
+  * Returns 0 on success, otherwise negative errno.
+  * Must run in process context.
+  */
+ int snappy_init_env(struct snappy_env *env)
+ {
+ 	env->hash_table = vmalloc(sizeof(u16) * kmax_hash_table_size);
+ 	if (!env->hash_table)
+ 		return -ENOMEM;
+ 	return 0;
+ }
+ EXPORT_SYMBOL(snappy_init_env);
+ 
+ /**
+  * snappy_free_env - Free an snappy compression environment
+  * @env: Environment to free.
+  *
+  * Must run in process context.
+  */
+ void snappy_free_env(struct snappy_env *env)
+ {
+ 	vfree(env->hash_table);
+ #ifdef SG
+ 	vfree(env->scratch);
+ 	vfree(env->scratch_output);
+ #endif
+ 	memset(env, 0, sizeof(struct snappy_env));
+ }
+ EXPORT_SYMBOL(snappy_free_env);
*** /dev/null
--- b/src/include/utils/compat.h
***************
*** 0 ****
--- 1,39 ----
+ 
+ #include <stdlib.h>
+ #include <assert.h>
+ #include <string.h>
+ #include <errno.h>
+ #include <stdbool.h>
+ #include <limits.h>
+ #include <sys/uio.h>
+ 
+ typedef unsigned char u8;
+ typedef unsigned short u16;
+ typedef unsigned u32;
+ typedef unsigned long long u64;
+ 
+ #define BUG_ON(x) assert(!(x))
+ 
+ #define get_unaligned(x) (*(x))
+ #define get_unaligned_le32(x) (le32toh(*(u32 *)(x)))
+ #define put_unaligned(v,x) (*(x) = (v))
+ #define put_unaligned_le16(v,x) (*(u16 *)(x) = htole16(v))
+ 
+ #define vmalloc(x) malloc(x)
+ #define vfree(x) free(x)
+ 
+ #define EXPORT_SYMBOL(x)
+ 
+ #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+ 
+ #define likely(x) __builtin_expect((x), 1)
+ #define unlikely(x) __builtin_expect((x), 0)
+ 
+ #define min_t(t,x,y) ((x) < (y) ? (x) : (y))
+ #define max_t(t,x,y) ((x) > (y) ? (x) : (y))
+ 
+ #if __BYTE_ORDER == __LITTLE_ENDIAN
+ #define __LITTLE_ENDIAN__ 1
+ #endif
+ 
+ #define BITS_PER_LONG (__SIZEOF_LONG__ * 8)
*** /dev/null
--- b/src/include/utils/snappy.h
***************
*** 0 ****
--- 1,36 ----
+ #ifndef _LINUX_SNAPPY_H
+ #define _LINUX_SNAPPY_H 
+ 
+ 
+ /* Only needed for compression. This preallocates the worst case */
+ struct snappy_env {
+ 	unsigned short *hash_table;
+ 	void *scratch;
+ 	void *scratch_output;
+ };
+ 
+ struct iovec;
+ int snappy_init_env(struct snappy_env *env);
+ int snappy_init_env_sg(struct snappy_env *env, bool sg);
+ void snappy_free_env(struct snappy_env *env);
+ int snappy_uncompress_iov(struct iovec *iov_in, int iov_in_len,
+ 			   size_t input_len, char *uncompressed);
+ int snappy_uncompress(const char *compressed, size_t n, char *uncompressed);
+ int snappy_compress(struct snappy_env *env,
+ 		    const char *input,
+ 		    size_t input_length,
+ 		    char *compressed,
+ 		    size_t *compressed_length);
+ int snappy_compress_iov(struct snappy_env *env,
+ 			struct iovec *iov_in,
+ 			int iov_in_len,
+ 			size_t input_length,
+ 			struct iovec *iov_out,
+ 			int iov_out_len,
+ 			size_t *compressed_length);
+ bool snappy_uncompressed_length(const char *buf, size_t len, size_t *result);
+ size_t snappy_max_compressed_length(size_t source_len);
+ 
+ 
+ 
+ #endif

wal_update_snappy_concat_oldandnew_tuple_v1.patchapplication/octet-stream; name=wal_update_snappy_concat_oldandnew_tuple_v1.patchDownload

*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,66 ****
--- 60,69 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/snappy.h"
  
+ /* guc variable for EWT compression ratio*/
+ int                   wal_update_compression_ratio = 25;
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
  #define ATT_IS_PACKABLE(att) \
***************
*** 617,622 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 620,679 ----
  	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
  }
  
+ /* ----------------
+  * heap_delta_encode
+  *
+  *		Calculate the delta between two tuples, using pglz. The result is
+  * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+  * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+  * ----------------
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+ 				  char *encdata, uint32 *enclen)
+ {
+ 	struct snappy_env env;
+ 	int err;
+ 	char *oldtupdata = (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	int32 oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 	char *newtupdata = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
+ 	int32 newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 	char		buf[2 * MaxHeapTupleSize];
+ 
+ 	memcpy(buf, oldtupdata, oldtuplen);
+ 	memcpy(buf + oldtuplen, newtupdata, newtuplen);
+ 
+ 	err = snappy_init_env(&env);
+ 	if (err)
+ 		return false;
+ 
+ 	err = snappy_compress(&env, buf, oldtuplen + newtuplen, encdata, (size_t *)enclen);
+ 	snappy_free_env(&env);
+ 	if (err)
+ 		return false;
+ 
+ 	return true;
+ }
+ 
+ /* ----------------
+  * heap_delta_decode
+  *
+  *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+  * ----------------
+  */
+ void
+ heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+ {
+ 	char		buf[2 * MaxHeapTupleSize];
+ 	int32 oldtuplen = oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	snappy_uncompressed_length(encdata, enclen, (size_t *)&newtup->t_len);
+ 
+ 	snappy_uncompress(encdata, enclen, buf);
+ 	memcpy((char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 			buf + oldtuplen, newtup->t_len - oldtuplen);
+ }
+ 
  /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 74,79 ****
--- 74,80 ----
  
  /* GUC variable */
  bool		synchronize_seqscans = true;
+ extern int     wal_update_compression_ratio;
  
  
  static HeapScanDesc heap_beginscan_internal(Relation relation,
***************
*** 5815,5820 **** log_heap_update(Relation reln, Buffer oldbuf,
--- 5816,5827 ----
  	XLogRecPtr	recptr;
  	XLogRecData rdata[4];
  	Page		page = BufferGetPage(newbuf);
+ 	char	   *newtupdata;
+ 	int			newtuplen;
+ 	bool		compressed = false;
+ 
+ 	/* Structure which holds EWT */
+ 	char		buf[MaxHeapTupleSize];
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
***************
*** 5824,5838 **** log_heap_update(Relation reln, Buffer oldbuf,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = oldtup->t_self;
  	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
  	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
  											  oldtup->t_data->t_infomask2);
  	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 5831,5878 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+ 	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 
+ 	/*
+ 	 * EWT can be generated for all new tuple versions created by Update
+ 	 * operation. Currently we do it when both the old and new tuple versions
+ 	 * are on same page, because during recovery if the page containing old
+ 	 * tuple is corrupt, it should not cascade that corruption to other pages.
+ 	 * Under the general assumption that for long runs most updates tend to
+ 	 * create new tuple version on same page, there should not be significant
+ 	 * impact on WAL reduction or performance.
+ 	 *
+ 	 * We should not generate EWT when we need to backup the whole bolck in
+ 	 * WAL as in that case there is no saving by reduced WAL size.
+ 	 */
+ 	if (wal_update_compression_ratio != 0 && (newtuplen > 32)
+ 		&& (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+ 	{
+ 		uint32 enclen;
+ 		/* Delta-encode the new tuple using the old tuple */
+ 		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+ 		{
+ 			compressed = true;
+ 			newtupdata = buf;
+ 			newtuplen = enclen;
+ 		}
+ 	}
+ 
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = oldtup->t_self;
  	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
  	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
  											  oldtup->t_data->t_infomask2);
  	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (compressed)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 5859,5867 **** log_heap_update(Relation reln, Buffer oldbuf,
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
! 	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
! 	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
--- 5899,5910 ----
  	rdata[2].buffer_std = true;
  	rdata[2].next = &(rdata[3]);
  
! 	/*
! 	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
! 	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
! 	 */
! 	rdata[3].data = newtupdata;
! 	rdata[3].len = newtuplen;
  	rdata[3].buffer = newbuf;
  	rdata[3].buffer_std = true;
  	rdata[3].next = NULL;
***************
*** 6671,6677 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 6714,6723 ----
  	Page		page;
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
+ 	HeapTupleData newtup;
+ 	HeapTupleData oldtup;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtupdata = NULL;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 6686,6692 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 6732,6738 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 6746,6752 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
  	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
--- 6792,6798 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
  	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
***************
*** 6764,6770 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 6810,6816 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 6788,6794 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 6834,6840 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 6851,6860 **** newsame:;
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 6897,6927 ----
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the record is EWT then decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		/*
! 		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
! 		 * + New data (1 byte length + variable data)+ ...
! 		 */
! 		oldtup.t_data = oldtupdata;
! 		oldtup.t_len = ItemIdGetLength(lp);
! 		newtup.t_data = htup;
! 
! 		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
! 		newlen = newtup.t_len;
! 	}
! 	else
! 	{
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 6870,6876 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 6937,6943 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 1249,1254 **** begin:;
--- 1249,1276 ----
  }
  
  /*
+  * Determine whether the buffer referenced has to be backed up. Since we don't
+  * yet have the insert lock, fullPageWrites and forcePageWrites could change
+  * later, but will not cause any problem because this function is used only to
+  * identify whether EWT is required for WAL update.
+  */
+ bool
+ XLogCheckBufferNeedsBackup(Buffer buffer)
+ {
+ 	bool		doPageWrites;
+ 	Page		page;
+ 
+ 	page = BufferGetPage(buffer);
+ 
+ 	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+ 
+ 	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+ 		return true;			/* buffer requires backup */
+ 
+ 	return false;				/* buffer does not need to be backed up */
+ }
+ 
+ /*
   * Determine whether the buffer referenced by an XLogRecData item has to
   * be backed up, and if so fill a BkpBlock struct for it.  In any case
   * save the buffer's LSN at *lsn.
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 124,129 **** extern char *default_tablespace;
--- 124,130 ----
  extern char *temp_tablespaces;
  extern bool ignore_checksum_failure;
  extern bool synchronize_seqscans;
+ extern int	wal_update_compression_ratio;
  extern int	ssl_renegotiation_limit;
  extern char *SSLCipherSuites;
  
***************
*** 2410,2415 **** static struct config_int ConfigureNamesInt[] =
--- 2411,2427 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		/* Not for general use */
+ 		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+ 			gettext_noop("Sets the compression ratio of delta record for wal update"),
+ 			NULL,
+ 		},
+ 		&wal_update_compression_ratio,
+ 		25, 0, 100,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 147,159 **** typedef struct xl_heap_update
  	TransactionId old_xmax;		/* xmax of the old tuple */
  	TransactionId new_xmax;		/* xmax of the new tuple */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 147,168 ----
  	TransactionId old_xmax;		/* xmax of the old tuple */
  	TransactionId new_xmax;		/* xmax of the new tuple */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
! 	uint8		flags;			/* flag bits, see below */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
! 														 * page's all visible
! 														 * bit is cleared */
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
! 														 * page's all visible
! 														 * bit is cleared */
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
! 														 * update operation is
! 														 * delta encoded */
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 687,692 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 687,697 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				  HeapTuple newtup, char *encdata, uint32 *enclen);
+ extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ 				HeapTuple newtup);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 261,266 **** typedef struct CheckpointStatsData
--- 261,267 ----
  extern CheckpointStatsData CheckpointStats;
  
  extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+ extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
  extern void XLogFlush(XLogRecPtr RecPtr);
  extern bool XLogBackgroundFlush(void);
  extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
*** a/src/test/regress/expected/update.out
--- b/src/test/regress/expected/update.out
***************
*** 97,99 **** SELECT a, b, char_length(c) FROM update_test;
--- 97,169 ----
  (2 rows)
  
  DROP TABLE update_test;
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ DROP TABLE IF EXISTS update_test;
+ NOTICE:  table "update_test" does not exist, skipping
+ CREATE TABLE update_test (
+ 		bser bigserial,
+ 		bln boolean,
+ 		ename VARCHAR(25),
+ 		perf_f float(8),
+ 		grade CHAR,
+ 		dept CHAR(5) NOT NULL,
+ 		dob DATE,
+ 		idnum INT,
+ 		addr VARCHAR(30) NOT NULL,
+ 		destn CHAR(6),
+ 		Gend CHAR,
+ 		samba BIGINT,
+ 		hgt float,
+ 		ctime TIME
+ );
+ INSERT INTO update_test VALUES (
+ 		nextval('update_test_bser_seq'::regclass),
+ 		TRUE,
+ 		'Test',
+ 		7.169,
+ 		'B',
+ 		'CSD',
+ 		'2000-01-01',
+ 		520,
+ 		'road2,
+ 		streeeeet2,
+ 		city2',
+ 		'dcy2',
+ 		'M',
+ 		12000,
+ 		50.4,
+ 		'00:00:00.0'
+ );
+ SELECT * from update_test;
+  bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+     1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+       |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+       |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+ (1 row)
+ 
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ SELECT * from update_test;
+  bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+ ------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+     1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+       |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+       |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+ (1 row)
+ 
+ DROP TABLE update_test;
*** a/src/test/regress/sql/update.sql
--- b/src/test/regress/sql/update.sql
***************
*** 59,61 **** UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
--- 59,128 ----
  SELECT a, b, char_length(c) FROM update_test;
  
  DROP TABLE update_test;
+ 
+ 
+ --
+ -- Test to update continuos and non continuos columns
+ --
+ 
+ DROP TABLE IF EXISTS update_test;
+ CREATE TABLE update_test (
+ 		bser bigserial,
+ 		bln boolean,
+ 		ename VARCHAR(25),
+ 		perf_f float(8),
+ 		grade CHAR,
+ 		dept CHAR(5) NOT NULL,
+ 		dob DATE,
+ 		idnum INT,
+ 		addr VARCHAR(30) NOT NULL,
+ 		destn CHAR(6),
+ 		Gend CHAR,
+ 		samba BIGINT,
+ 		hgt float,
+ 		ctime TIME
+ );
+ 
+ INSERT INTO update_test VALUES (
+ 		nextval('update_test_bser_seq'::regclass),
+ 		TRUE,
+ 		'Test',
+ 		7.169,
+ 		'B',
+ 		'CSD',
+ 		'2000-01-01',
+ 		520,
+ 		'road2,
+ 		streeeeet2,
+ 		city2',
+ 		'dcy2',
+ 		'M',
+ 		12000,
+ 		50.4,
+ 		'00:00:00.0'
+ );
+ 
+ SELECT * from update_test;
+ 
+ -- update first column
+ UPDATE update_test SET bser = bser - 1 + 1;
+ 
+ -- update middle column
+ UPDATE update_test SET perf_f = 8.9;
+ 
+ -- update last column
+ UPDATE update_test SET ctime = '00:00:00.1';
+ 
+ -- update 3 continuos columns
+ UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+ 
+ -- update two non continuos columns
+ UPDATE update_test SET destn = 'moved', samba = 0;
+ UPDATE update_test SET bln = FALSE, hgt = 10.1;
+ 
+ -- update causing some column alignment difference
+ UPDATE update_test SET ename = 'Tes';
+ UPDATE update_test SET dept = 'Test';
+ 
+ SELECT * from update_test;
+ DROP TABLE update_test;

pglz-with-micro-optimizations-4.patchapplication/octet-stream; name=pglz-with-micro-optimizations-4.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
 
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
 #define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+							 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+							 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fe56318..24c117c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
+extern int     wal_update_compression_ratio;
 
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5815,6 +5817,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -5824,15 +5832,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole bolck in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32 enclen;
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -5859,9 +5899,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6671,7 +6714,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -6686,7 +6732,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6746,7 +6792,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6764,7 +6810,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -6788,7 +6834,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6851,10 +6897,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + New data (1 byte length + variable data)+ ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -6870,7 +6937,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 07c68ad..c3a94a2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1249,6 +1249,28 @@ begin:;
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..35e8206 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
  *			of identical bytes like trailing spaces) and for bigger ones
  *			our 4K maximum look-back distance is too small.
  *
- *			The compressor creates a table for 8192 lists of positions.
+ *			The compressor creates a table for lists of positions.
  *			For each input position (except the last 3), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.
+ *			back-pointers larger than that anyway.  The size of the hash
+ *			table depends on the size of the input - a larger table has
+ *			a larger startup cost, as it needs to be initialized to zero,
+ *			but reduces the number of hash collisions on long inputs.
  *
  *			For each byte in the input, it's hash key (built from this
  *			byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
  * Local definitions
  * ----------
  */
-#define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
-#define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
 #define PGLZ_HISTORY_SIZE		4096
 #define PGLZ_MAX_MATCH			273
 
@@ -198,9 +200,9 @@
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	int16		next;			/* links for my hash key's list */
+	int16		prev;
+	uint32		hindex;			/* my current hash key */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -241,9 +243,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * Statically allocated work arrays for history
  * ----------
  */
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY			0
 
 /* ----------
  * pglz_hist_idx -
@@ -257,12 +261,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e) (												\
+#define pglz_hist_idx(_s,_e, _mask) (										\
 			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 9) ^ ((_s)[1] << 6) ^								\
-			  ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK)				\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)							\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) 									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) 							\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -276,32 +308,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e));						\
-			PGLZ_HistEntry **__myhsp = &(_hs)[__hindex];					\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
+				if (__myhe->prev == INVALID_ENTRY)							\
 					(_hs)[__myhe->hindex] = __myhe->next;					\
 				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
+					(_he)[__myhe->prev].next = __myhe->next;				\
+				if (__myhe->next != INVALID_ENTRY)							\
+					(_he)[__myhe->next].prev = __myhe->prev;				\
 			}																\
 			__myhe->next = *__myhsp;										\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
-			if (*__myhsp != NULL)											\
-				(*__myhsp)->prev = __myhe;									\
-			*__myhsp = __myhe;												\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE) {								\
-				(_hn) = 0;													\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
+				(_hn) = 1;													\
 				(_recycle) = true;											\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex)	\
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = *__myhsp;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
+
 
 /* ----------
  * pglz_out_ctrl -
@@ -372,28 +421,42 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex)
 {
-	PGLZ_HistEntry *hent;
+	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hent = hstart[pglz_hist_idx(input, end)];
-	while (hent)
+	hentno = hstart[hindex];
+	while (hentno != INVALID_ENTRY)
 	{
+		PGLZ_HistEntry *hent = &hist_entries[hentno];
 		const char *ip = input;
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		maxlen = PGLZ_MAX_MATCH;
+		if (end - input < maxlen)
+			maxlen = end - input;
+		if (hend && (hend - hp < maxlen))
+			maxlen = hend - hp;
+
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
+		if (!hend)
+			thisoff = ip - hp;
+		else
+			thisoff = hend - hp;
+
 		if (thisoff >= 0x0fff)
 			break;
 
@@ -413,7 +476,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -423,7 +486,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -443,13 +506,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		hentno = hent->next;
 
 		/*
 		 * Be happy with lesser good matches the more entries we visited. But
 		 * no point in doing calculation if we're at end of list.
 		 */
-		if (hent)
+		if (hentno != INVALID_ENTRY)
 		{
 			if (len >= good_match)
 				break;
@@ -471,6 +534,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For
+	 * a small input, the startup cost dominates. The table size must be
+	 * a power of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -484,7 +570,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 {
 	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
 	unsigned char *bstart = bp;
-	int			hist_next = 0;
+	int			hist_next = 1;
 	bool		hist_recycle = false;
 	const char *dp = source;
 	const char *dend = source + slen;
@@ -500,6 +586,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
+	int			hashsz;
+	int			mask;
 
 	/*
 	 * Our fallback strategy is the default.
@@ -555,17 +643,21 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Initialize the history lists to empty.  We do not need to zero the
 	 * hist_entries[] array; its entries are initialized as they are used.
 	 */
-	memset(hist_start, 0, sizeof(hist_start));
+	memset(hist_start, 0, hashsz * sizeof(int16));
 
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -588,8 +680,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop))
+							&match_off, good_match, good_drop, NULL, hindex))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -598,9 +691,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -614,7 +708,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -637,6 +731,198 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,b,c,d;
+	int32		hindex;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	pglz_hash_init(hp, hindex, a,b,c,d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the
+		 * new data, too.
+		 */
+		pglz_hash_roll(hp, hindex, a,b,c,d, mask);
+		pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 hp, hend, hindex);
+		hp++;			/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, hindex,a,b,c,d);
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp,hindex,a,b,c,d,mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			dp += match_len;
+			found_match = true;
+			pglz_hash_init(dp, hindex,a,b,c,d);
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.	*/
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;				/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -740,3 +1026,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history
+				 * to OUTPUT.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just
+				 * copy one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)	/* check for buffer overrun */
+					break;	/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 22ba35f..6ff6b23 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -124,6 +124,7 @@ extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool synchronize_seqscans;
+extern int	wal_update_compression_ratio;
 extern int	ssl_renegotiation_limit;
 extern char *SSLCipherSuites;
 
@@ -2410,6 +2411,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index e58eae5..386277d 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -147,13 +147,22 @@ typedef struct xl_heap_update
 	TransactionId old_xmax;		/* xmax of the old tuple */
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
-	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+														 * update operation is
+														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index cd01ecd..1ef550b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -687,6 +687,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f8f06c1..56efcac 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#20

Amit Kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Heikki Linnakangas (#14)

2 attachment(s)

On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous

patch):

The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default,
this probably just isn't worth it.

The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function,
it goes further than that, and contains some further micro-
optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more. One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for
speed.

If you could squeeze pglz_delta_encode function to be cheap enough that
we could enable this by default, this would be pretty cool patch. Or at
least, the overhead in the cases that you get no compression needs to
be brought down, to about 2-5 % at most I think. If it can't be done
easily, I feel that this probably needs to be dropped.

After trying some more on optimizing pglz_delta_encode(), I found that if we
use new data also in history, then the results of compression
and cpu utilization are much better.

In addition to the pg lz micro optimization changes, following changes are
done in modified patch

1. The unmatched new data is also added to the history which can be
referenced later.
2. To incorporate this change in the lZ algorithm, 1 extra control bit is
needed to indicate if data is from old or new tuple

Performance Data
-----------------

Head code:

testname | wal_generated | duration
-----------------------------------------+---------------+------------------

two short fields, no change | 1232908016 | 36.3914430141449
two short fields, one changed | 1232904040 | 36.5231261253357
two short fields, both changed | 1235215048 | 37.7455959320068
one short and one long field, no change | 1051394568 | 24.418487071991
ten tiny fields, all changed | 1395189872 | 43.2316210269928
hundred tiny fields, first 10 changed | 622156848 | 21.9155580997467
hundred tiny fields, all changed | 625962056 | 22.3296411037445
hundred tiny fields, half changed | 621901128 | 21.3881061077118
hundred tiny fields, half nulled | 557708096 | 19.4633228778839

pglz-with-micro-optimization-compress-using-newdata-1:

testname | wal_generated | duration
-----------------------------------------+---------------+------------------

two short fields, no change | 1235992768 | 37.3365149497986
two short fields, one changed | 1240979256 | 36.897796869278
two short fields, both changed | 1236079976 | 38.4273149967194
one short and one long field, no change | 651010944 | 20.9490079879761
ten tiny fields, all changed | 1315606864 | 42.5771369934082
hundred tiny fields, first 10 changed | 459134432 | 17.4556930065155
hundred tiny fields, all changed | 456506680 | 17.8865270614624
hundred tiny fields, half changed | 454784456 | 18.0130441188812
hundred tiny fields, half nulled | 486675784 | 18.6600229740143

Observation
---------------
1. It yielded compression in more cases (refer all cases of hundred tiny
fields)
2. CPU- utilization is also better.

Performance data for pgbench related scenarios is attached in document
(pgbench_lz_opt_compress_using_newdata.htm)

1. Better reduction in WAL
2. TPS increase can be observed after records size is >=250
3. There is small performance penality for single-thread (0.04~3.45), but
when penality is 3.45 in single thread, for 8 threads TPS improvement is
high.

Do you think it matches the conditions you have in mind for further
proceeding of this patch?

Thanks to Hari Babu for helping in implementation of this idea and taking
performance data.

With Regards,
Amit Kapila.

Attachments:

pglz-with-micro-optimization-compress-using-newdata-1.patchapplication/octet-stream; name=pglz-with-micro-optimization-compress-using-newdata-1.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
 
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
 #define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+							 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+							 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e88dd30..0997fe2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
+extern int     wal_update_compression_ratio;
 
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5827,6 +5829,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -5836,15 +5844,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole bolck in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32 enclen;
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -5871,9 +5911,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6683,7 +6726,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -6698,7 +6744,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6758,7 +6804,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6776,7 +6822,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -6800,7 +6846,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6867,10 +6913,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + New data (1 byte length + variable data)+ ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -6886,7 +6953,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fbc722c..b13be74 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1239,6 +1239,28 @@ begin:;
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 66c64c1..a7876e0 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -112,7 +112,7 @@
  *			of identical bytes like trailing spaces) and for bigger ones
  *			our 4K maximum look-back distance is too small.
  *
- *			The compressor creates a table for 8192 lists of positions.
+ *			The compressor creates a table for lists of positions.
  *			For each input position (except the last 3), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
@@ -120,7 +120,10 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.
+ *			back-pointers larger than that anyway.  The size of the hash
+ *			table depends on the size of the input - a larger table has
+ *			a larger startup cost, as it needs to be initialized to zero,
+ *			but reduces the number of hash collisions on long inputs.
  *
  *			For each byte in the input, it's hash key (built from this
  *			byte and the next 3) is used to find the appropriate list
@@ -180,8 +183,7 @@
  * Local definitions
  * ----------
  */
-#define PGLZ_HISTORY_LISTS		8192	/* must be power of 2 */
-#define PGLZ_HISTORY_MASK		(PGLZ_HISTORY_LISTS - 1)
+#define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
 #define PGLZ_HISTORY_SIZE		4096
 #define PGLZ_MAX_MATCH			273
 
@@ -198,9 +200,10 @@
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	int16		next;			/* links for my hash key's list */
+	int16		prev;
+	uint32		hindex;			/* my current hash key */
+	bool		from_history;   /* Is the hash entry from history buffer? */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -241,9 +244,11 @@ const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
  * Statically allocated work arrays for history
  * ----------
  */
-static PGLZ_HistEntry *hist_start[PGLZ_HISTORY_LISTS];
-static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
+static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
+static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 
+/* Element 0 in hist_entries is unused, and means 'invalid'. */
+#define INVALID_ENTRY			0
 
 /* ----------
  * pglz_hist_idx -
@@ -257,12 +262,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e) (												\
+#define pglz_hist_idx(_s,_e, _mask) (										\
 			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 9) ^ ((_s)[1] << 6) ^								\
-			  ((_s)[2] << 3) ^ (_s)[3])) & (PGLZ_HISTORY_MASK)				\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)							\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) 									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) 							\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -276,32 +309,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e) \
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e));						\
-			PGLZ_HistEntry **__myhsp = &(_hs)[__hindex];					\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
-				if (__myhe->prev == NULL)									\
+				if (__myhe->prev == INVALID_ENTRY)							\
 					(_hs)[__myhe->hindex] = __myhe->next;					\
 				else														\
-					__myhe->prev->next = __myhe->next;						\
-				if (__myhe->next != NULL)									\
-					__myhe->next->prev = __myhe->prev;						\
+					(_he)[__myhe->prev].next = __myhe->next;				\
+				if (__myhe->next != INVALID_ENTRY)							\
+					(_he)[__myhe->next].prev = __myhe->prev;				\
 			}																\
 			__myhe->next = *__myhsp;										\
-			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
-			if (*__myhsp != NULL)											\
-				(*__myhsp)->prev = __myhe;									\
-			*__myhsp = __myhe;												\
-			if (++(_hn) >= PGLZ_HISTORY_SIZE) {								\
-				(_hn) = 0;													\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			if (++(_hn) >= PGLZ_HISTORY_SIZE + 1) {							\
+				(_hn) = 1;													\
 				(_recycle) = true;											\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex, _from_history)	\
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = *__myhsp;										\
+			__myhe->prev = INVALID_ENTRY;									\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			if (*__myhsp != INVALID_ENTRY)									\
+				(_he)[(*__myhsp)].prev = _hn;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+			__myhe->from_history = _from_history;							\
+} while (0)
 
 /* ----------
  * pglz_out_ctrl -
@@ -364,6 +414,49 @@ do { \
 
 
 /* ----------
+ * pglz_out_tag_encode -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination/history buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_tag_encode(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_from_history) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_from_history)														\
+		_ctrlb |= _ctrl;													\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pglz_out_literal_encode -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_literal_encode(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 2;															\
+} while (0)
+
+/* ----------
  * pglz_find_match -
  *
  *		Lookup the history table if the actual input stream matches
@@ -372,28 +465,48 @@ do { \
  * ----------
  */
 static inline int
-pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop)
+pglz_find_match(int16 *hstart, const char *input, const char *end,
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex, bool *from_history)
 {
-	PGLZ_HistEntry *hent;
+	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
+	bool		history_match = false;
+
+	*from_history = false;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hent = hstart[pglz_hist_idx(input, end)];
-	while (hent)
+	hentno = hstart[hindex];
+	while (hentno != INVALID_ENTRY)
 	{
+		PGLZ_HistEntry *hent = &hist_entries[hentno];
 		const char *ip = input;
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		history_match = false;
+		maxlen = PGLZ_MAX_MATCH;
+		if (hent->from_history && (hend - hp < maxlen))
+			maxlen = hend - hp;
+		else if (end - input < maxlen)
+			maxlen = end - input;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
+		if (hent->from_history)
+		{
+			history_match = true;
+			thisoff = hend - hp;
+		}
+		else
+			thisoff = ip - hp;
+
 		if (thisoff >= 0x0fff)
 			break;
 
@@ -413,7 +526,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -423,7 +536,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -436,6 +549,7 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		 */
 		if (thislen > len)
 		{
+			*from_history = history_match;
 			len = thislen;
 			off = thisoff;
 		}
@@ -443,13 +557,13 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 		/*
 		 * Advance to the next history entry
 		 */
-		hent = hent->next;
+		hentno = hent->next;
 
 		/*
 		 * Be happy with lesser good matches the more entries we visited. But
 		 * no point in doing calculation if we're at end of list.
 		 */
-		if (hent)
+		if (hentno != INVALID_ENTRY)
 		{
 			if (len >= good_match)
 				break;
@@ -471,6 +585,29 @@ pglz_find_match(PGLZ_HistEntry **hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For
+	 * a small input, the startup cost dominates. The table size must be
+	 * a power of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -484,7 +621,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 {
 	unsigned char *bp = ((unsigned char *) dest) + sizeof(PGLZ_Header);
 	unsigned char *bstart = bp;
-	int			hist_next = 0;
+	int			hist_next = 1;
 	bool		hist_recycle = false;
 	const char *dp = source;
 	const char *dend = source + slen;
@@ -500,6 +637,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	int32		result_size;
 	int32		result_max;
 	int32		need_rate;
+	int			hashsz;
+	int			mask;
 
 	/*
 	 * Our fallback strategy is the default.
@@ -555,17 +694,23 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Initialize the history lists to empty.  We do not need to zero the
 	 * hist_entries[] array; its entries are initialized as they are used.
 	 */
-	memset(hist_start, 0, sizeof(hist_start));
+	memset(hist_start, 0, hashsz * sizeof(int16));
 
 	/*
 	 * Compress the source directly into the output buffer.
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
+		bool 		from_history;
+
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -588,8 +733,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop))
+							&match_off, good_match, good_drop, NULL, hindex,
+							&from_history))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -598,9 +745,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -614,7 +762,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -637,6 +785,205 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,b,c,d;
+	int32		hindex;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen + slen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	pglz_hash_init(hp, hindex, a,b,c,d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the
+		 * new data, too.
+		 */
+		pglz_hash_roll(hp, hindex, a,b,c,d, mask);
+		pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 hp, hend, hindex, true);
+		hp++;			/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, hindex,a,b,c,d);
+	while (dp < dend - 4)
+	{
+		bool from_history;
+
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp,hindex,a,b,c,d,mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex,
+							&from_history))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag_encode(ctrlp, ctrlb, ctrl, bp, match_len, match_off, from_history);
+			dp += match_len;
+			found_match = true;
+			pglz_hash_init(dp, hindex,a,b,c,d);
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 dp, dend, hindex, false);
+
+			pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.	*/
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;				/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -740,3 +1087,124 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc += 2)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history
+				 * to OUTPUT.
+				 */
+				if ((ctrl >> 1) & 1)
+				{
+					memcpy(dp, hend - off, len);
+					dp += len;
+				}
+				else
+				{
+					/*
+					 * Now we copy the bytes specified by the tag from OUTPUT to
+					 * OUTPUT. It is dangerous and platform dependent to use
+					 * memcpy() here, because the copied areas could overlap
+					 * extremely!
+					 */
+					while (len--)
+					{
+						*dp = dp[-off];
+						dp++;
+					}
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just
+				 * copy one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)	/* check for buffer overrun */
+					break;	/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 2;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..5bcf40b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -124,6 +124,7 @@ extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool synchronize_seqscans;
+extern int	wal_update_compression_ratio;
 extern int	ssl_renegotiation_limit;
 extern char *SSLCipherSuites;
 
@@ -2410,6 +2411,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4381778..36e7dc8 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -148,12 +148,21 @@ typedef struct xl_heap_update
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
 	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+ 														 * update operation is
+ 														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 0a832e9..830349b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -689,6 +689,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index b4a75ce..032a422 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

pgbench_lz_opt_compress_using_newdata.htmtext/html; name=pgbench_lz_opt_compress_using_newdata.htmDownload

#21

Hari Babu

haribabu.kommi@huawei.com

over 12 years ago

In reply to: Amit Kapila (#20)

2 attachment(s)

On Friday, June 07, 2013 5:07 PM Amit Kapila wrote:

On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous

patch):

The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default,
this probably just isn't worth it.

The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function,
it goes further than that, and contains some further micro-
optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more. One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for
speed.

If you could squeeze pglz_delta_encode function to be cheap enough
that we could enable this by default, this would be pretty cool patch.
Or at least, the overhead in the cases that you get no compression
needs to be brought down, to about 2-5 % at most I think. If it can't
be done easily, I feel that this probably needs to be dropped.

After trying some more on optimizing pglz_delta_encode(), I found that if

we use new data also in history, then the results of compression and cpu
utilization >are much better.

In addition to the pg lz micro optimization changes, following changes are

done in modified patch

1. The unmatched new data is also added to the history which can be

referenced later.

2. To incorporate this change in the lZ algorithm, 1 extra control bit is

needed to indicate if data is from old or new tuple

The patch is rebased to use the new PG LZ algorithm optimization changes
which got committed recently.

Performance Data
-----------------

Head code:

testname | wal_generated | duration
-----------------------------------------+---------------+------------------

two short fields, no change | 1232911016 | 35.1784930229187
two short fields, one changed | 1240322016 | 35.0436308383942
two short fields, both changed | 1235318352 | 35.4989421367645
one short and one long field, no change | 1042332336 | 23.4457180500031
ten tiny fields, all changed | 1395194136 | 41.9023628234863
hundred tiny fields, first 10 changed | 626725984 | 21.2999589443207
hundred tiny fields, all changed | 621899224 | 21.6676609516144
hundred tiny fields, half changed | 623998272 | 21.2745981216431
hundred tiny fields, half nulled | 557714088 | 19.5902800559998

pglz-with-micro-optimization-compress-using-newdata-2:

testname | wal_generated | duration
-----------------------------------------+---------------+------------------

two short fields, no change | 1232903384 | 35.0115969181061
two short fields, one changed | 1232906960 | 34.3333759307861
two short fields, both changed | 1232903520 | 35.7665238380432
one short and one long field, no change | 649647992 | 19.4671010971069
ten tiny fields, all changed | 1314957136 | 39.9727990627289
hundred tiny fields, first 10 changed | 458684024 | 17.8197758197784
hundred tiny fields, all changed | 461028464 | 17.3083391189575
hundred tiny fields, half changed | 456528696 | 17.1769199371338
hundred tiny fields, half nulled | 480548936 | 18.81720495224

Observation
---------------
1. It yielded compression in more cases (refer all cases of hundred tiny
fields)
2. CPU- utilization is also better.

Performance data for pgbench related scenarios is attached in document
(pgbench_lz_opt_compress_using_newdata-2.htm)

1. Better reduction in WAL
2. TPS increase can be observed after records size is >=250
3. There is small performance penality for single-thread (0.36~3.23),
but when penality is 3.23 in single thread, for 8 threads TPS improvement
is high.

Please suggest any further proceedings on this patch.

Regards,
Hari babu.

Attachments:

pgbench_lz_opt_compress_using_newdata-2.htmtext/html; name=pgbench_lz_opt_compress_using_newdata-2.htmDownload

pglz-with-micro-optimization-compress-using-newdata-2.patchapplication/octet-stream; name=pglz-with-micro-optimization-compress-using-newdata-2.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..acf88de 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
 
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
 #define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+							 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+							 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+							 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1531f3b..ed51650 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
+extern int     wal_update_compression_ratio;
 
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5830,6 +5832,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -5839,15 +5847,47 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole bolck in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32 enclen;
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -5874,9 +5914,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG93FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6686,7 +6729,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -6701,7 +6747,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6761,7 +6807,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6779,7 +6825,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -6803,7 +6849,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6870,10 +6916,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG93FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + New data (1 byte length + variable data)+ ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -6889,7 +6956,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0ce661b..306961c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1239,6 +1239,28 @@ begin:;
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index ae67519..a98277e 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -200,9 +200,10 @@
  */
 typedef struct PGLZ_HistEntry
 {
-	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
-	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	struct PGLZ_HistEntry	*next;			/* links for my hash key's list */
+	struct PGLZ_HistEntry	*prev;
+	uint32		hindex;			/* my current hash key */
+	bool		from_history;   /* Is the hash entry from history buffer? */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -265,12 +266,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
+#define pglz_hist_idx(_s,_e, _mask) (									\
+			((((_e) - (_s)) < 4) ? (int) (_s)[0] :					\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^							\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)						\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d) 									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask) 							\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -284,10 +313,9 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
 				if (__myhe->prev == NULL)									\
@@ -299,7 +327,7 @@ do {									\
 			}																\
 			__myhe->next = &(_he)[*__myhsp];								\
 			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
 			/* If there was an existing entry in this hash slot, link */	\
 			/* this new entry to it. However, the 0th entry in the */		\
@@ -317,6 +345,23 @@ do {									\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex, _from_history)	\
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->prev = NULL;											\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			(_he)[(*__myhsp)].prev = __myhe;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+			__myhe->from_history = _from_history;							\
+} while (0)
 
 /* ----------
  * pglz_out_ctrl -
@@ -379,6 +424,49 @@ do { \
 
 
 /* ----------
+ * pglz_out_tag_encode -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination/history buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_tag_encode(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_from_history) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_from_history)														\
+		_ctrlb |= _ctrl;													\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pglz_out_literal_encode -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_literal_encode(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 2;															\
+} while (0)
+
+/* ----------
  * pglz_find_match -
  *
  *		Lookup the history table if the actual input stream matches
@@ -388,17 +476,21 @@ do { \
  */
 static inline int
 pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex, bool *from_history)
 {
 	PGLZ_HistEntry *hent;
 	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
+	bool		history_match = false;
+
+	*from_history = false;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
+	hentno = hstart[hindex];
 	hent = &hist_entries[hentno];
 	while (hent != INVALID_ENTRY_PTR)
 	{
@@ -406,11 +498,26 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		history_match = false;
+		maxlen = PGLZ_MAX_MATCH;
+		if (hent->from_history && (hend - hp < maxlen))
+			maxlen = hend - hp;
+		else if (end - input < maxlen)
+			maxlen = end - input;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
+		if (hent->from_history)
+		{
+			history_match = true;
+			thisoff = hend - hp;
+		}
+		else
+			thisoff = ip - hp;
+
 		if (thisoff >= 0x0fff)
 			break;
 
@@ -430,7 +537,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -440,7 +547,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -453,6 +560,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 */
 		if (thislen > len)
 		{
+			*from_history = history_match;
 			len = thislen;
 			off = thisoff;
 		}
@@ -466,7 +574,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 * Be happy with lesser good matches the more entries we visited. But
 		 * no point in doing calculation if we're at end of list.
 		 */
-		if (hent)
+		if (hentno != INVALID_ENTRY)
 		{
 			if (len >= good_match)
 				break;
@@ -488,6 +596,29 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For
+	 * a small input, the startup cost dominates. The table size must be
+	 * a power of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -574,6 +705,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
 	 * hash table minimizes collision, but has a higher startup cost. For
@@ -603,6 +737,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
+		bool 		from_history;
+
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -625,8 +762,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+							&match_off, good_match, good_drop, NULL, hindex,
+							&from_history))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -635,9 +774,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend, mask);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -651,7 +791,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend, mask);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -674,6 +814,205 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,b,c,d;
+	int32		hindex;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen + slen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	pglz_hash_init(hp, hindex, a,b,c,d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the
+		 * new data, too.
+		 */
+		pglz_hash_roll(hp, hindex, a,b,c,d, mask);
+		pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 hp, hend, hindex, true);
+		hp++;			/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, hindex,a,b,c,d);
+	while (dp < dend - 4)
+	{
+		bool from_history;
+
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp,hindex,a,b,c,d,mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex,
+							&from_history))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag_encode(ctrlp, ctrlb, ctrl, bp, match_len, match_off, from_history);
+			dp += match_len;
+			found_match = true;
+			pglz_hash_init(dp, hindex,a,b,c,d);
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 dp, dend, hindex, false);
+
+			pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.	*/
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;				/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -777,3 +1116,124 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc += 2)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history
+				 * to OUTPUT.
+				 */
+				if ((ctrl >> 1) & 1)
+				{
+					memcpy(dp, hend - off, len);
+					dp += len;
+				}
+				else
+				{
+					/*
+					 * Now we copy the bytes specified by the tag from OUTPUT to
+					 * OUTPUT. It is dangerous and platform dependent to use
+					 * memcpy() here, because the copied areas could overlap
+					 * extremely!
+					 */
+					while (len--)
+					{
+						*dp = dp[-off];
+						dp++;
+					}
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just
+				 * copy one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)	/* check for buffer overrun */
+					break;	/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 2;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3a76536..e2c42af 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -125,6 +125,7 @@ extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool synchronize_seqscans;
+extern int	wal_update_compression_ratio;
 extern int	ssl_renegotiation_limit;
 extern char *SSLCipherSuites;
 
@@ -2411,6 +2412,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4381778..36e7dc8 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -148,12 +148,21 @@ typedef struct xl_heap_update
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
 	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+ 														 * update operation is
+ 														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 0a832e9..830349b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -689,6 +689,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 83e5832..4e6914c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#22

Mike Blackwell

mike.blackwell@rrd.com

over 12 years ago

In reply to: Amit kapila (#2)

I can't comment on further direction for the patch, but since it was marked
as Needs Review in the CF app I took a quick look at it.

It patches and compiles clean against the current Git HEAD, and 'make
check' runs successfully.

Does it need documentation for the GUC variable
'wal_update_compression_ratio'?

__________________________________________________________________________________
*Mike Blackwell | Technical Analyst, Distribution Services/Rollout
Management | RR Donnelley*
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com

<http://www.rrdonnelley.com/>
* <Mike.Blackwell@rrd.com>*

On Tue, Jul 2, 2013 at 2:26 AM, Hari Babu <haribabu.kommi@huawei.com> wrote:

Show quoted text

On Friday, June 07, 2013 5:07 PM Amit Kapila wrote:

On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous

patch):

The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default,
this probably just isn't worth it.

The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function,
it goes further than that, and contains some further micro-
optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more. One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for
speed.

If you could squeeze pglz_delta_encode function to be cheap enough
that we could enable this by default, this would be pretty cool patch.
Or at least, the overhead in the cases that you get no compression
needs to be brought down, to about 2-5 % at most I think. If it can't
be done easily, I feel that this probably needs to be dropped.

After trying some more on optimizing pglz_delta_encode(), I found that if

we use new data also in history, then the results of compression and cpu
utilization >are much better.

In addition to the pg lz micro optimization changes, following changes are

done in modified patch

1. The unmatched new data is also added to the history which can be

referenced later.

2. To incorporate this change in the lZ algorithm, 1 extra control bit is

needed to indicate if data is from old or new tuple

The patch is rebased to use the new PG LZ algorithm optimization changes
which got committed recently.

Performance Data
-----------------

Head code:

testname | wal_generated | duration

-----------------------------------------+---------------+------------------

two short fields, no change | 1232911016 | 35.1784930229187
two short fields, one changed | 1240322016 | 35.0436308383942
two short fields, both changed | 1235318352 | 35.4989421367645
one short and one long field, no change | 1042332336 | 23.4457180500031
ten tiny fields, all changed | 1395194136 | 41.9023628234863
hundred tiny fields, first 10 changed | 626725984 | 21.2999589443207
hundred tiny fields, all changed | 621899224 | 21.6676609516144
hundred tiny fields, half changed | 623998272 | 21.2745981216431
hundred tiny fields, half nulled | 557714088 | 19.5902800559998

pglz-with-micro-optimization-compress-using-newdata-2:

testname | wal_generated | duration

-----------------------------------------+---------------+------------------

two short fields, no change | 1232903384 | 35.0115969181061
two short fields, one changed | 1232906960 | 34.3333759307861
two short fields, both changed | 1232903520 | 35.7665238380432
one short and one long field, no change | 649647992 | 19.4671010971069
ten tiny fields, all changed | 1314957136 | 39.9727990627289
hundred tiny fields, first 10 changed | 458684024 | 17.8197758197784
hundred tiny fields, all changed | 461028464 | 17.3083391189575
hundred tiny fields, half changed | 456528696 | 17.1769199371338
hundred tiny fields, half nulled | 480548936 | 18.81720495224

Observation
---------------
1. It yielded compression in more cases (refer all cases of hundred tiny
fields)
2. CPU- utilization is also better.

Performance data for pgbench related scenarios is attached in document
(pgbench_lz_opt_compress_using_newdata-2.htm)

1. Better reduction in WAL
2. TPS increase can be observed after records size is >=250
3. There is small performance penality for single-thread (0.36~3.23),
but when penality is 3.23 in single thread, for 8 threads TPS
improvement
is high.

Please suggest any further proceedings on this patch.

Regards,
Hari babu.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 51d2813e.a77d440a.2190.7de7SMTPIN_ADDED_BROKEN@mx.google.com

#23

Josh Berkus

josh@agliodbs.com

over 12 years ago

In reply to: Amit kapila (#2)

On 07/08/2013 02:21 PM, Mike Blackwell wrote:

I can't comment on further direction for the patch, but since it was marked
as Needs Review in the CF app I took a quick look at it.

It patches and compiles clean against the current Git HEAD, and 'make
check' runs successfully.

Does it need documentation for the GUC variable
'wal_update_compression_ratio'?

Yes.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: WM9b2239656a7f7166795da027b34532073fd1092954cfd6dabc1c0fe5f5ddf3d1a311b0efa47d36e7e63d2fffe05bf77d@asav-2.01.com

#24

Amit Kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Mike Blackwell (#22)

On Tuesday, July 09, 2013 2:52 AM Mike Blackwell wrote:

I can't comment on further direction for the patch, but since it was marked as Needs Review in the CF app I took a quick look at it.

Thanks for looking into it.

Last time Heikki has found test scenario's where the original patch was not performing good.
He has also proposed a different approach for WAL encoding and sent the modified patch which has comparatively less negative performance impact and
asked to check if the patch can reduce the performance impact for the scenario's mentioned by him.
After that I found that with some modification's (use new tuple data for encoding) in his approach, it eliminates the negative performance impact and
have WAL reduction for more number of cases.

I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person.
The testcase used is posted at below link:
/messages/by-id/51366323.8070606@vmware.com

It patches and compiles clean against the current Git HEAD, and 'make check' runs successfully.

Does it need documentation for the GUC variable 'wal_update_compression_ratio'?

This variable has been added to test the patch for different compression_ratio during development test.
It was not decided to have this variable permanently as part of this patch, so currently there is no documentation for it.
However if the decision comes out to be that it needs to be part of patch, then documentation for same can be updated.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Mike Blackwell

mike.blackwell@rrd.com

over 12 years ago

In reply to: Amit kapila (#2)

The only environment I have available at the moment is a virtual box.
That's probably not going to be very helpful for performance testing.

<http://www.rrdonnelley.com/>
* <Mike.Blackwell@rrd.com>*

On Mon, Jul 8, 2013 at 11:09 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

Show quoted text

On Tuesday, July 09, 2013 2:52 AM Mike Blackwell wrote:

I can't comment on further direction for the patch, but since it was

marked as Needs Review in the CF app I took a quick look at it.
Thanks for looking into it.

Last time Heikki has found test scenario's where the original patch was
not performing good.
He has also proposed a different approach for WAL encoding and sent the
modified patch which has comparatively less negative performance impact and
asked to check if the patch can reduce the performance impact for the
scenario's mentioned by him.
After that I found that with some modification's (use new tuple data for
encoding) in his approach, it eliminates the negative performance impact
and
have WAL reduction for more number of cases.

I think the first thing to verify is whether the results posted can be
validated in some other environment setup by another person.
The testcase used is posted at below link:
/messages/by-id/51366323.8070606@vmware.com

It patches and compiles clean against the current Git HEAD, and 'make

check' runs successfully.

Does it need documentation for the GUC variable

'wal_update_compression_ratio'?

This variable has been added to test the patch for different
compression_ratio during development test.
It was not decided to have this variable permanently as part of this
patch, so currently there is no documentation for it.
However if the decision comes out to be that it needs to be part of
patch, then documentation for same can be updated.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 51db8e17.055a420a.6037.ffff8c92SMTPIN_ADDED_BROKEN@mx.google.com

#26

Amit Kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Mike Blackwell (#25)

On Wednesday, July 10, 2013 6:32 AM Mike Blackwell wrote:

The only environment I have available at the moment is a virtual box. That's probably not going to be very helpful for performance testing.

It's okay. Anyway thanks for doing the basic test for patch.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Greg Smith

greg@2ndQuadrant.com

over 12 years ago

In reply to: Amit Kapila (#24)

On 7/9/13 12:09 AM, Amit Kapila wrote:

I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person.
The testcase used is posted at below link:
/messages/by-id/51366323.8070606@vmware.com

That seems easy enough to do here, Heikki's test script is excellent.
The latest patch Hari posted on July 2 has one hunk that doesn't apply
anymore now. Inside src/backend/utils/adt/pg_lzcompress.c the patch
tries to change this code:

-               if (hent)
+               if (hentno != INVALID_ENTRY)

But that line looks like this now:

if (hent != INVALID_ENTRY_PTR)

Definitions of those:

#define INVALID_ENTRY 0
#define INVALID_ENTRY_PTR (&hist_entries[INVALID_ENTRY])

I'm not sure if different error handling may be needed here now due the
commit that changed this, or if the patch wasn't referring to the right
type of error originally.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Stephen Frost

sfrost@snowman.net

over 12 years ago

In reply to: Greg Smith (#27)

Greg,

* Greg Smith (greg@2ndQuadrant.com) wrote:

That seems easy enough to do here, Heikki's test script is
excellent. The latest patch Hari posted on July 2 has one hunk that
doesn't apply anymore now. Inside
src/backend/utils/adt/pg_lzcompress.c the patch tries to change this
code:
-               if (hent)
+               if (hentno != INVALID_ENTRY)

hentno certainly doesn't make much sense here- it's only used at the top
of the function to keep things a bit cleaner when extracting the address
into hent from hist_entries:

hentno = hstart[pglz_hist_idx(input, end, mask)];
hent = &hist_entries[hentno];

Indeed, as the referenced conditional is inside the following loop:

while (hent != INVALID_ENTRY_PTR)

and, since hentno == 0 implies hent == INVALID_ENTRY_PTR, the
conditional would never fail (which is what was happening prior to
Heikki commiting the fix for this, changing the conditional to what is
below).

But that line looks like this now:

if (hent != INVALID_ENTRY_PTR)

Right, this is correct- it's useful to check the new value for hent
after it's been updated by:

hent = hent->next;

and see if it's possible to drop out early.

I'm not sure if different error handling may be needed here now due
the commit that changed this, or if the patch wasn't referring to
the right type of error originally.

I've not looked at anything regarding this beyond this email, but I'm
pretty confident that the change Heikki committed was the correct one.

Thanks,

Stephen

#29

Hari Babu

haribabu.kommi@huawei.com

over 12 years ago

In reply to: Greg Smith (#27)

1 attachment(s)

On Friday, July 19, 2013 4:11 AM Greg Smith wrote:

On 7/9/13 12:09 AM, Amit Kapila wrote:

I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person.
The testcase used is posted at below link:
/messages/by-id/51366323.8070606@vmware.com

That seems easy enough to do here, Heikki's test script is excellent.
The latest patch Hari posted on July 2 has one hunk that doesn't apply
anymore now.

The Head code change from Heikki is correct.
During the patch rebase to latest PG LZ optimization code, the above code change is missed.

Apart from the above changed some more changes are done in the patch, those are.

1. corrected some comments in the code
2. Added a validity check as source and history length combined cannot be more than or equal to 8192.

Thanks for the review, please find the latest patch attached in the mail.

Regards,
Hari babu.

Attachments:

pglz-with-micro-optimization-compress-using-newdata-3.patchapplication/octet-stream; name=pglz-with-micro-optimization-compress-using-newdata-3.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..875434d 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,7 +60,11 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
 
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
 #define ATT_IS_PACKABLE(att) \
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+							 encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					  oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5bcbc92..6dc362e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,10 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
+extern int	wal_update_compression_ratio;
 
 
 static HeapScanDesc heap_beginscan_internal(Relation relation,
@@ -5844,6 +5846,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -5853,15 +5861,48 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole bolck in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_compression_ratio != 0 && (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -5888,9 +5929,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG94FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6700,7 +6744,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -6715,7 +6762,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6775,7 +6822,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6793,7 +6840,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -6817,7 +6864,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6884,10 +6931,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -6903,7 +6971,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 96aceb9..fed305d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2308,6 +2308,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 1c129b8..cbf6064 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -120,7 +120,7 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.  The size of the hash
+ *			back-pointers larger than that anyway.	The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
  *			zero, but reduces the number of hash collisions on long inputs.
@@ -202,7 +202,8 @@ typedef struct PGLZ_HistEntry
 {
 	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
 	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	uint32		hindex;			/* my current hash key */
+	bool		from_history;	/* Is the hash entry from history buffer? */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -265,12 +266,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
+#define pglz_hist_idx(_s,_e, _mask) (									\
+			((((_e) - (_s)) < 4) ? (int) (_s)[0] :					\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^							\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)						\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d)									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask)								\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -284,10 +313,9 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
 				if (__myhe->prev == NULL)									\
@@ -299,7 +327,7 @@ do {									\
 			}																\
 			__myhe->next = &(_he)[*__myhsp];								\
 			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
 			/* If there was an existing entry in this hash slot, link */	\
 			/* this new entry to it. However, the 0th entry in the */		\
@@ -317,6 +345,23 @@ do {									\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex, _from_history) \
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->prev = NULL;											\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			(_he)[(*__myhsp)].prev = __myhe;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+			__myhe->from_history = _from_history;							\
+} while (0)
 
 /* ----------
  * pglz_out_ctrl -
@@ -379,6 +424,49 @@ do { \
 
 
 /* ----------
+ * pglz_out_tag_encode -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination/history buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_tag_encode(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_from_history) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_from_history)														\
+		_ctrlb |= _ctrl;													\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pglz_out_literal_encode -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_literal_encode(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 2;															\
+} while (0)
+
+/* ----------
  * pglz_find_match -
  *
  *		Lookup the history table if the actual input stream matches
@@ -388,17 +476,21 @@ do { \
  */
 static inline int
 pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex, bool *from_history)
 {
 	PGLZ_HistEntry *hent;
 	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
+	bool		history_match = false;
+
+	*from_history = false;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
+	hentno = hstart[hindex];
 	hent = &hist_entries[hentno];
 	while (hent != INVALID_ENTRY_PTR)
 	{
@@ -406,11 +498,26 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		history_match = false;
+		maxlen = PGLZ_MAX_MATCH;
+		if (hent->from_history && (hend - hp < maxlen))
+			maxlen = hend - hp;
+		else if (end - input < maxlen)
+			maxlen = end - input;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
+		if (hent->from_history)
+		{
+			history_match = true;
+			thisoff = hend - hp;
+		}
+		else
+			thisoff = ip - hp;
+
 		if (thisoff >= 0x0fff)
 			break;
 
@@ -430,7 +537,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -440,7 +547,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -453,6 +560,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 */
 		if (thislen > len)
 		{
+			*from_history = history_match;
 			len = thislen;
 			off = thisoff;
 		}
@@ -488,6 +596,29 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -574,11 +705,14 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
-	 * hash table minimizes collision, but has a higher startup cost. For
-	 * a small input, the startup cost dominates. The table size must be
-	 * a power of two.
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
 	 */
 	if (slen < 128)
 		hashsz = 512;
@@ -603,6 +737,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
+		bool		from_history;
+
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -625,8 +762,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+							&match_off, good_match, good_drop, NULL, hindex,
+							&from_history))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -635,9 +774,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend, mask);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -651,7 +791,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend, mask);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -674,6 +814,209 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,
+				b,
+				c,
+				d;
+	int32		hindex;
+
+	/*
+	 * Tuples of source and history length greater than 2 * PGLZ_HISTORY_SIZE
+	 * are not allowed for delta encode as this is the maximum size of history
+	 * offset. And also tuples with history data less than 4 are not allowed.
+	 */
+	if (((hlen + slen) >= (2 * PGLZ_HISTORY_SIZE)) || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen + slen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	pglz_hash_init(hp, hindex, a, b, c, d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the new
+		 * data, too.
+		 */
+		pglz_hash_roll(hp, hindex, a, b, c, d, mask);
+		pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 hp, hend, hindex, true);
+		hp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, hindex, a, b, c, d);
+	while (dp < dend - 4)
+	{
+		bool		from_history;
+
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp, hindex, a, b, c, d, mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex,
+							&from_history))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag_encode(ctrlp, ctrlb, ctrl, bp, match_len, match_off, from_history);
+			dp += match_len;
+			found_match = true;
+			pglz_hash_init(dp, hindex, a, b, c, d);
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_hist_add_no_recycle(hist_start, hist_entries,
+									 hist_next,
+									 dp, dend, hindex, false);
+
+			pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.	*/
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -777,3 +1120,124 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc += 2)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT.
+				 */
+				if ((ctrl >> 1) & 1)
+				{
+					memcpy(dp, hend - off, len);
+					dp += len;
+				}
+				else
+				{
+					/*
+					 * Now we copy the bytes specified by the tag from OUTPUT
+					 * to OUTPUT. It is dangerous and platform dependent to
+					 * use memcpy() here, because the copied areas could
+					 * overlap extremely!
+					 */
+					while (len--)
+					{
+						*dp = dp[-off];
+						dp++;
+					}
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 2;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2b753f8..13ef553 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -125,6 +125,7 @@ extern char *default_tablespace;
 extern char *temp_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool synchronize_seqscans;
+extern int	wal_update_compression_ratio;
 extern int	ssl_renegotiation_limit;
 extern char *SSLCipherSuites;
 
@@ -2437,6 +2438,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4381778..36e7dc8 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -148,12 +148,21 @@ typedef struct xl_heap_update
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
 	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+ 														 * update operation is
+ 														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 0a832e9..830349b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -689,6 +689,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 002862c..0a928d9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -262,6 +262,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#30

Greg Smith

greg@2ndQuadrant.com

over 12 years ago

In reply to: Hari Babu (#29)

1 attachment(s)

The v3 patch applies perfectly here now. Attached is a spreadsheet with
test results from two platforms, a Mac laptop and a Linux server. I
used systems with high disk speed because that seemed like a worst case
for this improvement. The actual improvement for shrinking WAL should
be even better on a system with slower disks.

There are enough problems with the "hundred tiny fields" results that I
think this not quite ready for commit yet. More comments on that below.
This seems close though, close enough that I am planning to follow up
to see if the slow disk results are better.

Reviewing the wal-update-testsuite.sh test program, I think the only
case missing that would be useful to add is "ten tiny fields, one
changed". I think that one is interesting to highlight because it's
what benchmark programs like pgbench do very often.

The GUC added by the program looks like this:

postgres=# show wal_update_compression_ratio ;
wal_update_compression_ratio
------------------------------
25

Is possible to add a setting here that disables the feature altogether?
That always makes it easier to consider a commit, knowing people can
roll back the change if it makes performance worse. That would make
performance testing easier too. wal-update-testsuit.sh takes as long as
13 minutes, it's long enough that I'd like the easier to automate
comparison GUC disabling adds. If that's not practical to do given the
intrusiveness of the code, it's not really necessary. I haven't looked
at the change enough to be sure how hard this is.

On the Mac, the only case that seems to have a slowdown now is "hundred
tiny fields, half nulled". It would be nice to understand just what is
going on with that one. I got some ugly results in "two short fields,
no change" too, along with a couple of other weird results, but I think
those were testing procedure issues that can be ignored. The pgbench
throttle work I did recently highlights that I can't really make a Mac
quiet/consistent for benchmarking very well. Note that I ran all of the
Mac tests with assertions on, to try and catch platform specific bugs.
The Linux ones used the default build parameters.

On Linux "hundred tiny fields, half nulled" was also by far the worst
performing one, with a >30% increase in duration despite the 14% drop in
WAL. Exactly what's going on there really needs to be investigated
before this seems safe to commit. All of the "hundred tiny fields"
cases seem pretty bad on Linux, with the rest of them running about a
11% duration increase.

This doesn't seem ready to commit for this CF, but the number of problem
cases is getting pretty small now. Now that I've gotten more familiar
with the test programs and the feature, I can run more performance tests
on this at any time really. If updates addressing the trouble cases are
ready from Amit or Hari before the next CF, send them out and I can look
at them without waiting until that one starts. This is a very promising
looking performance feature.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachments:

WAL-lz-v3.xlsapplication/vnd.ms-excel; name=WAL-lz-v3.xlsDownload

������;��	������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������	

����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������Root Entry����������������������������������������������������������������	�
�����\pCalc                                                                                                         B�a�=���=@ 8�@�"��1���Arial1���Arial1���Arial1���Arial1���Arial�GENERAL	�0.0%���� � ��� �� ��� �� ��� �� ��� �� ��� �� ��� �� ��� �� ��� �� ��� �� ��� �� ��� �� ��� �� ��� �� ��� �� �� � �+�� �� �)�� �� �,�� �� �*�� �� �	�� �� �� � �� � ������������������`��Sheet1��.Sheet2��0Sheet3���T��j�b�(3���	�@@�
��FMac OS X with SSDMasterv3Change)                testname                  wal_generated 
     durationwalduration) two short fields, no change             ) two short fields, one changed           ) two short fields, both changed          ) one short and one long field, no change ) ten tiny fields, all changed            ) hundred tiny fields, all changed        ) hundred tiny fields, half changed       ) hundred tiny fields, half nulled        Linux SL6 with BBWC�
�cc
	�
�d����MbP?_�%���*+�$!&C&"Times New Roman,Regular"&12&A)&&C&"Times New Roman,Regular"&12Page &P��&333333�?'333333�?(�-��-��?)�-��-��?�"d�,,333333�?333333�?U}x"}E}C}E}C}l}�}�5���������	�
���
�������������������
�
�
�
�
�
�
�
�
�
�
�
	����A8�:@�L��A��;@(;��7A�D�D�D�(����m?D�D�D��
	���A���� HE@����AR��F@(H^����_�D�D�D�(�!H��(�?D�D�D��
	L��A���4W=@����A����x�G@(�%I���Y�D�D�D�(����O�?D�D�D��

����A������G@�3��A���A@(��
�]?D�D�D�(�Z@�����D�D�D��

4n��A��E@f���A���F@(1'��-P�D�D�D�(�L0���?D�D�D��

Fr��A���i��@@|M��A������:@(��M�9?D�D�D�(�J}�L^��D�D�D��
���AD��E@���A���g �G@(J����\,�D�D�D�(���H�?D�D�D��
		bh��A	�����m@@	�c��A	���8@(	����L+?D	�D	�D	�(	J.�����D	�D	�D	��


����A
���E@
`���A
dG@(
Q]	�"e=?D
�D
�D
�(
���D��?D
�D
�D
��
�u�A�71=@)��A���5;@(��s����D�D�D�(�(��h��D�D�D��
����A���Y�[A@����A������:@(@�Mc���D�D�D�(�30��/��D�D�D��


�q�A
>8CA@
���A
����
l;@(
���i���D
�D
�D
�(
�Ek�U��D
�D
�D
��

����A��"J@� ��A����A2@@(�������D�D�D�(_Qr0�V��D�D�D��

����A���}��J@d|��A�E@(j�G,$��D�D�D�(�;Qd���D�D�D��

����As�I@@U��Ax�wB@(1}�����D�D�D�(|�;�����D�D�D��
`���A����u8@~
��n	���*@(�U �.J��D�D�D�(m������D�D�D��
d&��A���;�V8@~
�c~n ��7@(T���A��D�D�D�(��������D�D�D��
, ��A����E8@~
�nzn���S�7@(�KOA�e��D�D�D�(�������D�D�D��
���A���w�W7@~
���n��v7@(.�d�qE��D�D�D�(��
E{{u?D�D�D��
����A��7@~
�+zn
��7@(�"stx��D�D�D�(DdMW�Jv?D�D�D��
tq��A����f�8@~
B�yn���+��7@(��n�Z��D�D�D�(9�u-���D�D�D��
hO�A
���6@~
��\u���7@(/�����D�D�D�(��q�~��?D�D�D��
���A���6@~
��\u���G��8@(.�V����D�D�D�(g�R�2��?D�D�D��
D�A�����f6@~
��\u������8@(�(2����D�D�D�(�8�����?D�D�D��
�
�
�
�
�
�
�
�
�
�
�
	����A������2@4���Ad��2@(R�7��N��D�D�D�(�HK�u��?D�D�D��
	
���A���w�2@����A��2@("B�(�>D�D�D�(8'O���?D�D�D��
	X���A���_�2@b���A���2@(��R�(X��D�D�D�(^����^^?D�D�D� �!�"�#�$�%�&�'�(�)�*�+�,�-�.�/�0�1�2�3�4��
 
 ����A ������2@ 0���A 
�|�2@( �o�>�h�D �D �D �( sR���y?D �D �D ��
!
!h���A!���3��2@!����A!���[H�2@(!$��U��>D!�D!�D!�(!��9YD�f�D!�D!�D!��
"
"���A"���2@"����A"�����2@("���i���D"�D"�D"�("������?D"�D"�D"��
##����A#���k�G3@#����A#�����b3@(#'[��4���D#�D#�D#�(#do
���v?D#�D#�D#��
$$p��A$P�F3@$����A$������3@($3������D$�D$�D$�($�8u��5�?D$�D$�D$��
%%����A%��b3@%$���A%�����3@(%v����D�>D%�D%�D%�(%��2b:��?D%�D%�D%��
&&d�p�A&������0@&�4��A&���7\<+@(&�������D&�D&�D&�(&(7*_���D&�D&�D&��
''��p�A'�.0@'���A'�+@('�J#G���D'�D'�D'�('td��qO��D'�D'�D'��
((T�p�A(����1@(���A(���W��+@((������D(�D(�D(�((	tOg<���D(�D(�D(��
)
)^���A)�����7@)X���A)������6@()iK�0��D)�D)�D)�()�B�����D)�D)�D)��
*
*�4��A*���O�:7@*���A*�����7@(*h��1��D*�D*�D*�(*����u�D*�D*�D*��
+
+����A+�����a7@+H���A+�8�6@(+_��a�+��D+�D+�D+�(+�@�y���D+�D+�D+��
,,p]��A,S[(@~
,�zn,����b�+@(,y��D��D,�D,�D,�(,��P!��?D,�D,�D,��
--\���A-�����(@~
-�zn-�[h+@(-��*E��D-�D-�D-�(-�r�T���?D-�D-�D-��
..(P��A.(��(@~
.�Wzn.�8:+@(.���V�C��D.�D.�D.�(.��~j���?D.�D.�D.��
//���A/@�(@~
/zn/������+@(/�le�E��D/�D/�D/�(/**�>T��?D/�D/�D/��
00����A0��(@~
0"5zn0�,�+@(0�fH��_��D0�D0�D0�(0b?����?D0�D0�D0��
11����A1�h�(@~
1��yn1����LR+@(1h�:��G��D1�D1�D1�(1w����%�?D1�D1�D1��
22t���A2Z�&@~
2"�au2������-@(2��AC����D2�D2�D2�(2�=1�/��?D2�D2�D2��
33���A3���&@~
3"�\u3���?��-@(3K��(����D3�D3�D3�(3�������?D3�D3�D3��
44�k�A4���&@~
4��v4@r�-@(4Y�a�
���D4�D4�D4�(4�A��?�?D4�D4�D4��P�H��0�(	�
�>�@gg����
	�
�d����MbP?_�%���*+�$!&C&"Times New Roman,Regular"&12&A)&&C&"Times New Roman,Regular"&12Page &P��&333333�?'333333�?(�-��-��?)�-��-��?�"d,,333333�?333333�?U}��P�H ��0�(	�
�>�@<dgg����
	�
�d����MbP?_�%���*+�$!&C&"Times New Roman,Regular"&12&A)&&C&"Times New Roman,Regular"&12Page &P��&333333�?'333333�?(�-��-��?)�-��-��?�"d,,333333�?333333�?U}��P�H0��0�(	�
�>�@<dgg����
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
�����FMicrosoft Excel 97-TabelleBiff8�������Oh��+'��0|8	@
LXd
p��7@���_@@�5\	��@����������.��+,��D����.��+,��\����Root Entry���������FWorkbook�����2CompObj��������IOle
������������SummaryInformation(���������DocumentSummaryInformation8������������t��������������������������������

#31

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Hari Babu (#29)

On 2013-07-19 10:40:01 +0530, Hari Babu wrote:

On Friday, July 19, 2013 4:11 AM Greg Smith wrote:

On 7/9/13 12:09 AM, Amit Kapila wrote:

I think the first thing to verify is whether the results posted can be validated in some other environment setup by another person.
The testcase used is posted at below link:
/messages/by-id/51366323.8070606@vmware.com

That seems easy enough to do here, Heikki's test script is excellent.
The latest patch Hari posted on July 2 has one hunk that doesn't apply
anymore now.

The Head code change from Heikki is correct.
During the patch rebase to latest PG LZ optimization code, the above code change is missed.

Apart from the above changed some more changes are done in the patch, those are.

FWIW I don't like this approach very much:

* I'd be very surprised if this doesn't make WAL replay of update heavy
workloads slower by at least factor of 2.

* It makes data recovery from WAL *noticeably* harder since data
corruption now is carried forwards and you need the old data to decode
new data

* It makes changeset extraction either more expensive or it would have
to be disabled there.

I think my primary issue is that philosophically/architecturally I am of
the opinion that a wal record should make sense of it's own without
depending on heap data. And this patch looses that.

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Greg Smith

greg@2ndQuadrant.com

over 12 years ago

In reply to: Andres Freund (#31)

On 7/22/13 2:57 PM, Andres Freund wrote:

* I'd be very surprised if this doesn't make WAL replay of update heavy
workloads slower by at least factor of 2.

I was thinking about what a benchmark of WAL replay would look like last
year. I don't think that data is captured very well yet, and it should be.

My idea was to break the benchmark into two pieces. One would take a
base backup, then run a series of tests and archive the resulting the
WAL. I doubt you can make a useful benchmark here without a usefully
populated database, that's why the base backup step is needed.

The first useful result then is to measure how long commit/archiving
took and the WAL volume, which is what's done by the test harness for
this program. Then the resulting backup would be setup for replay.
tarring up the backup and WAL archive could even give you a repeatable
test set for ones where it's only replay changes happening. Then the
main number that's useful, total replay time, would be measured.

The main thing I wanted this for wasn't for code changes; it was to
benchmark configuration changes. I'd like to be able to answer
questions like "which I/O scheduler is best for a standby" in a way that
has real test data behind it. The same approach should useful for
answering your concerns about the replay performance impact of this
change too though.

* It makes changeset extraction either more expensive or it would have
to be disabled there.

That argues that if committed at all, the ability to turn this off I was
asking about would be necessary. It sounds like this *could* work like
how minimal WAL archiving levels allow optimizations that are disabled
at higher ones--like the COPY into a truncated/new table cheat.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Amit Kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Greg Smith (#30)

On Tuesday, July 23, 2013 12:02 AM Greg Smith wrote:

The v3 patch applies perfectly here now. Attached is a spreadsheet
with test results from two platforms, a Mac laptop and a Linux server.
I used systems with high disk speed because that seemed like a worst
case for this improvement. The actual improvement for shrinking WAL
should be even better on a system with slower disks.

You are absolutely right.
To mimic it on our system, by configuring RAMFS for database, it shows similar results.

There are enough problems with the "hundred tiny fields" results that I
think this not quite ready for commit yet. More comments on that
below.
This seems close though, close enough that I am planning to follow up
to see if the slow disk results are better.

Thanks for going extra mile to try for slower disks.

Reviewing the wal-update-testsuite.sh test program, I think the only
case missing that would be useful to add is "ten tiny fields, one
changed". I think that one is interesting to highlight because it's
what benchmark programs like pgbench do very often.
The GUC added by the program looks like this:

postgres=# show wal_update_compression_ratio ;
wal_update_compression_ratio
------------------------------
25

Is possible to add a setting here that disables the feature altogether?

Yes, it can be done in below 2 ways:
1. Provide a new configuration parameter (wal_update_compression) to turn on/off this feature.
2. At table level user can be given option to configure

That always makes it easier to consider a commit, knowing people can
roll back the change if it makes performance worse. That would make
performance testing easier too. wal-update-testsuit.sh takes as long
as
13 minutes, it's long enough that I'd like the easier to automate
comparison GUC disabling adds. If that's not practical to do given the
intrusiveness of the code, it's not really necessary. I haven't looked
at the change enough to be sure how hard this is.

On the Mac, the only case that seems to have a slowdown now is "hundred
tiny fields, half nulled". It would be nice to understand just what is
going on with that one. I got some ugly results in "two short fields,
no change" too, along with a couple of other weird results, but I think
those were testing procedure issues that can be ignored. The pgbench
throttle work I did recently highlights that I can't really make a Mac
quiet/consistent for benchmarking very well. Note that I ran all of
the Mac tests with assertions on, to try and catch platform specific
bugs.
The Linux ones used the default build parameters.

On Linux "hundred tiny fields, half nulled" was also by far the worst
performing one, with a >30% increase in duration despite the 14% drop
in WAL. Exactly what's going on there really needs to be investigated
before this seems safe to commit. All of the "hundred tiny fields"
cases seem pretty bad on Linux, with the rest of them running about a
11% duration increase.

The main benefit of this patch is to reduce WAL for improving time in I/O,
But I think for faster I/O systems, the calculation to reduce WAL has overhead.
I will check how to optimize that calculation, but I think this feature should be consider
with configuration knob as it can improve many cases.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Amit Kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Andres Freund (#31)

On Tuesday, July 23, 2013 12:27 AM Andres Freund wrote:

On 2013-07-19 10:40:01 +0530, Hari Babu wrote:

On Friday, July 19, 2013 4:11 AM Greg Smith wrote:

On 7/9/13 12:09 AM, Amit Kapila wrote:

I think the first thing to verify is whether the results posted

can be validated in some other environment setup by another person.

The testcase used is posted at below link:
http://www.postgresql.org/message-

id/51366323.8070606@vmware.com

That seems easy enough to do here, Heikki's test script is

excellent.

The latest patch Hari posted on July 2 has one hunk that doesn't

apply

anymore now.

The Head code change from Heikki is correct.
During the patch rebase to latest PG LZ optimization code, the above

code change is missed.

Apart from the above changed some more changes are done in the patch,

those are.

FWIW I don't like this approach very much:

* I'd be very surprised if this doesn't make WAL replay of update heavy
workloads slower by at least factor of 2.

Yes, if you just consider the cost of replay, but it involves other
operations as well
like for standby case transfer of WAL, Write of WAL, Read from WAL and
then apply.
So among them most operation's will be benefited from reduced WAL size,
except apply where you need to decode.

* It makes data recovery from WAL *noticeably* harder since data
corruption now is carried forwards and you need the old data to
decode
new data

This is one of the reasons why this optimization is done only when the
new row goes in same page.

* It makes changeset extraction either more expensive or it would have
to be disabled there.

I think, if there is any such implication, we can probably have the
option of disable it

I think my primary issue is that philosophically/architecturally I am
of
the opinion that a wal record should make sense of it's own without
depending on heap data. And this patch looses that.

Is the main worry about corruption getting propagated?

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Amit Kapila (#34)

On 2013-07-23 18:59:11 +0530, Amit Kapila wrote:

* I'd be very surprised if this doesn't make WAL replay of update heavy
workloads slower by at least factor of 2.

Yes, if you just consider the cost of replay, but it involves other
operations as well
like for standby case transfer of WAL, Write of WAL, Read from WAL and
then apply.
So among them most operation's will be benefited from reduced WAL size,
except apply where you need to decode.

I still think it's rather unlikely that they offset those. I've seen wal
replay be a major bottleneck more than once...

* It makes data recovery from WAL *noticeably* harder since data
corruption now is carried forwards and you need the old data to
decode
new data

This is one of the reasons why this optimization is done only when the
new row goes in same page.

That doesn't help all that much. It somewhat eases recovering data if
full_page_writes are on, but it's realy hard to stitch together all
changes if the corruption occured within a 1h long checkpoint...

* It makes changeset extraction either more expensive or it would have
to be disabled there.

I think, if there is any such implication, we can probably have the
option of disable it

That can just be done on wal_level = logical, that's not the
problem. It's certainly not with precedence that we have wal_level
dependent optimizations.

I think my primary issue is that philosophically/architecturally I am
of
the opinion that a wal record should make sense of it's own without
depending on heap data. And this patch looses that.

Is the main worry about corruption getting propagated?

Not really. It "feels" wrong to me architecturally. That's subjective, I
know.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Amit Kapila

amit.kapila@huawei.com

over 12 years ago

In reply to: Andres Freund (#35)

On Tuesday, July 23, 2013 7:06 PM Andres Freund wrote:

On 2013-07-23 18:59:11 +0530, Amit Kapila wrote:

* I'd be very surprised if this doesn't make WAL replay of update

heavy

workloads slower by at least factor of 2.

Yes, if you just consider the cost of replay, but it involves

other

operations as well
like for standby case transfer of WAL, Write of WAL, Read from

WAL and

then apply.
So among them most operation's will be benefited from reduced WAL

size,

except apply where you need to decode.

I still think it's rather unlikely that they offset those. I've seen
wal
replay be a major bottleneck more than once...

* It makes data recovery from WAL *noticeably* harder since data
corruption now is carried forwards and you need the old data to
decode
new data

This is one of the reasons why this optimization is done only when

the

new row goes in same page.

That doesn't help all that much. It somewhat eases recovering data if
full_page_writes are on, but it's realy hard to stitch together all
changes if the corruption occured within a 1h long checkpoint...

I think once a record is corrupted on page, user has to reconstruct that
page, it might be difficult
to just reconstruct a record and this optimization will not carry forward
any corruption from one to another
Page.

* It makes changeset extraction either more expensive or it would

have

to be disabled there.

I think, if there is any such implication, we can probably have

the

option of disable it

That can just be done on wal_level = logical, that's not the
problem. It's certainly not with precedence that we have wal_level
dependent optimizations.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Haribabu kommi

haribabu.kommi@huawei.com

over 12 years ago

In reply to: Amit Kapila (#33)

1 attachment(s)

On 23 July 2013 17:35 Amit Kapila wrote:

On Tuesday, July 23, 2013 12:02 AM Greg Smith wrote:

The v3 patch applies perfectly here now. Attached is a spreadsheet
with test results from two platforms, a Mac laptop and a Linux server.
I used systems with high disk speed because that seemed like a worst
case for this improvement. The actual improvement for shrinking WAL
should be even better on a system with slower disks.

You are absolutely right.
To mimic it on our system, by configuring RAMFS for database, it shows similar results.

Is possible to add a setting here that disables the feature altogether?

Yes, it can be done in below 2 ways:
1. Provide a new configuration parameter (wal_update_compression) to turn on/off this feature.
2. At table level user can be given option to configure

The main benefit of this patch is to reduce WAL for improving time in I/O, But I think for faster I/O systems, the calculation to reduce WAL has overhead.
I will check how to optimize that calculation, but I think this feature should be consider with configuration knob as it can improve many cases.

I tried to improve the performance of this feature on faster I/O systems where the calculation to reduce the WAL is an overhead, but resulted no success.
But this optimization is beneficial for a systems where the I/O is a bottleneck. To support those use cases I have added a configuration parameter "wal_update_optimization"
which is off by default. User can enable/disable this optimization for update operations based on its need. During replay of WAL record it can be identified easily as
it is an encode wal tuple or not by checking the flags.

Please let me know your suggestions on the same.

Regards,
Hari babu.

Attachments:

pglz-with-micro-optimization-compress-using-newdata-4.patchapplication/octet-stream; name=pglz-with-micro-optimization-compress-using-newdata-4.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..be87f9d 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,13 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+
+bool		wal_update_optimization = false;
+
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +624,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+							 encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					  oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b1a5d9f..efd9a94 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,12 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
 
-
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
@@ -5847,6 +5847,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -5856,15 +5862,49 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_optimization && wal_update_compression_ratio != 0
+		&& (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -5891,9 +5931,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG94FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6703,7 +6746,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -6718,7 +6764,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -6778,7 +6824,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -6796,7 +6842,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -6820,7 +6866,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -6887,10 +6933,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -6906,7 +6973,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dc47c47..d6f9dd2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2317,6 +2317,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 1c129b8..cbf6064 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -120,7 +120,7 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.  The size of the hash
+ *			back-pointers larger than that anyway.	The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
  *			zero, but reduces the number of hash collisions on long inputs.
@@ -202,7 +202,8 @@ typedef struct PGLZ_HistEntry
 {
 	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
 	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	uint32		hindex;			/* my current hash key */
+	bool		from_history;	/* Is the hash entry from history buffer? */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -265,12 +266,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
+#define pglz_hist_idx(_s,_e, _mask) (									\
+			((((_e) - (_s)) < 4) ? (int) (_s)[0] :					\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^							\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)						\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d)									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask)								\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -284,10 +313,9 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
 				if (__myhe->prev == NULL)									\
@@ -299,7 +327,7 @@ do {									\
 			}																\
 			__myhe->next = &(_he)[*__myhsp];								\
 			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
 			/* If there was an existing entry in this hash slot, link */	\
 			/* this new entry to it. However, the 0th entry in the */		\
@@ -317,6 +345,23 @@ do {									\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex, _from_history) \
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->prev = NULL;											\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			(_he)[(*__myhsp)].prev = __myhe;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+			__myhe->from_history = _from_history;							\
+} while (0)
 
 /* ----------
  * pglz_out_ctrl -
@@ -379,6 +424,49 @@ do { \
 
 
 /* ----------
+ * pglz_out_tag_encode -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination/history buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_tag_encode(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_from_history) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_from_history)														\
+		_ctrlb |= _ctrl;													\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pglz_out_literal_encode -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_literal_encode(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 2;															\
+} while (0)
+
+/* ----------
  * pglz_find_match -
  *
  *		Lookup the history table if the actual input stream matches
@@ -388,17 +476,21 @@ do { \
  */
 static inline int
 pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex, bool *from_history)
 {
 	PGLZ_HistEntry *hent;
 	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
+	bool		history_match = false;
+
+	*from_history = false;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
+	hentno = hstart[hindex];
 	hent = &hist_entries[hentno];
 	while (hent != INVALID_ENTRY_PTR)
 	{
@@ -406,11 +498,26 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		history_match = false;
+		maxlen = PGLZ_MAX_MATCH;
+		if (hent->from_history && (hend - hp < maxlen))
+			maxlen = hend - hp;
+		else if (end - input < maxlen)
+			maxlen = end - input;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
+		if (hent->from_history)
+		{
+			history_match = true;
+			thisoff = hend - hp;
+		}
+		else
+			thisoff = ip - hp;
+
 		if (thisoff >= 0x0fff)
 			break;
 
@@ -430,7 +537,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -440,7 +547,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -453,6 +560,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 */
 		if (thislen > len)
 		{
+			*from_history = history_match;
 			len = thislen;
 			off = thisoff;
 		}
@@ -488,6 +596,29 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -574,11 +705,14 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
-	 * hash table minimizes collision, but has a higher startup cost. For
-	 * a small input, the startup cost dominates. The table size must be
-	 * a power of two.
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
 	 */
 	if (slen < 128)
 		hashsz = 512;
@@ -603,6 +737,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
+		bool		from_history;
+
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -625,8 +762,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+							&match_off, good_match, good_drop, NULL, hindex,
+							&from_history))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -635,9 +774,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend, mask);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -651,7 +791,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend, mask);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -674,6 +814,209 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,
+				b,
+				c,
+				d;
+	int32		hindex;
+
+	/*
+	 * Tuples of source and history length greater than 2 * PGLZ_HISTORY_SIZE
+	 * are not allowed for delta encode as this is the maximum size of history
+	 * offset. And also tuples with history data less than 4 are not allowed.
+	 */
+	if (((hlen + slen) >= (2 * PGLZ_HISTORY_SIZE)) || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen + slen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	pglz_hash_init(hp, hindex, a, b, c, d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the new
+		 * data, too.
+		 */
+		pglz_hash_roll(hp, hindex, a, b, c, d, mask);
+		pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 hp, hend, hindex, true);
+		hp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, hindex, a, b, c, d);
+	while (dp < dend - 4)
+	{
+		bool		from_history;
+
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp, hindex, a, b, c, d, mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex,
+							&from_history))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag_encode(ctrlp, ctrlb, ctrl, bp, match_len, match_off, from_history);
+			dp += match_len;
+			found_match = true;
+			pglz_hash_init(dp, hindex, a, b, c, d);
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_hist_add_no_recycle(hist_start, hist_entries,
+									 hist_next,
+									 dp, dend, hindex, false);
+
+			pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.	*/
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -777,3 +1120,124 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc += 2)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT.
+				 */
+				if ((ctrl >> 1) & 1)
+				{
+					memcpy(dp, hend - off, len);
+					dp += len;
+				}
+				else
+				{
+					/*
+					 * Now we copy the bytes specified by the tag from OUTPUT
+					 * to OUTPUT. It is dangerous and platform dependent to
+					 * use memcpy() here, because the copied areas could
+					 * overlap extremely!
+					 */
+					while (len--)
+					{
+						*dp = dp[-off];
+						dp++;
+					}
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 2;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7d297bc..b22d2c9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -760,6 +760,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"wal_update_optimization", PGC_USERSET, WAL_SETTINGS,
+			gettext_noop("Enables the WAL optimizations for update operation."),
+			NULL
+		},
+		&wal_update_optimization,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
@@ -2435,6 +2444,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d69a02b..59b708c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -181,6 +181,9 @@
 #commit_delay = 0			# range 0-100000, in microseconds
 #commit_siblings = 5			# range 1-1000
 
+#wal_update_optimization = off		# Enables the WAL optimization for update
+					# operations
+
 # - Checkpoints -
 
 #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4381778..36e7dc8 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -148,12 +148,21 @@ typedef struct xl_heap_update
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
 	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+ 														 * update operation is
+ 														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 0a832e9..830349b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -689,6 +689,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 002862c..0a928d9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -262,6 +262,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0aa540a..c8366ac 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -235,6 +235,9 @@ extern bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 
+extern bool wal_update_optimization;
+extern int	wal_update_compression_ratio;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#38

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Greg Smith (#30)

On Mon, Jul 22, 2013 at 2:31 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On the Mac, the only case that seems to have a slowdown now is "hundred tiny
fields, half nulled". It would be nice to understand just what is going on
with that one. I got some ugly results in "two short fields, no change"
too, along with a couple of other weird results, but I think those were
testing procedure issues that can be ignored. The pgbench throttle work I
did recently highlights that I can't really make a Mac quiet/consistent for
benchmarking very well. Note that I ran all of the Mac tests with
assertions on, to try and catch platform specific bugs. The Linux ones used
the default build parameters.

Amit has been asking me to look at this patch for a while, so I
finally did. While I agree that it would be nice to get the CPU
overhead down low enough that we can turn this on by default and
forget about it, I'm not convinced that it's without value even if we
can't. Fundamentally, this patch trades away some CPU in exchanged
for decrease I/O. The testing thus far is all about whether the CPU
overhead can be made trivial, which is a good question to ask, because
if the answer is yes, then rather than trading something for something
else, we just get something for free. Win! But even if that doesn't
pan out, I think the fallback position should not be "OK, well, if we
can't get decreased I/O for free then forget it" but rather "OK, if we
can't get decreased I/O for free then let's get decreased I/O in
exchange for increased CPU usage".

I spent a little time running the tests from Heikki's script under
perf. On all three "two short fields" tests and also on the "ten tiny
fields, all changed" test, we spend about 1% of the CPU time in
pglz_delta_encode. I don't see any evidence that it's actually
compressing anything at all; it appears to be falling out where we
test the input length against the strategy, presumably because the
default strategy (which we largely copy here) doesn't try to compress
input data of less than 32 bytes. Given that this code isn't
actually compressing anything in these cases, I'm a bit confused by
Greg's report of substantial gains on "ten tiny fields, all changed"
test; how can we win if we're not compressing?

I studied the "hundred tiny fields, half nulled" test case in some
detail. Some thoughts:

- There is a comment "TODO: It would be nice to behave like the
history and the source strings were concatenated, so that you could
compress using the new data, too." If we're not already doing that,
then how are we managing to compress WAL by more than one-quarter in
the "hundred tiny fields, all changed" case? It looks to me like the
patch IS doing that, and I'm not sure it's a good idea, especially
because it's using pglz_hist_add_no_recycle rather than pglz_add_hist:
we verify that hlen + slen < 2 * PGLZ_HISTORY_SIZE but that doesn't
seem good enough. On the "hundred tiny fields, half nulled" test case,
removing that line reduces compression somewhat but also saves on CPU
cycles.

- pglz_find_match() is happy to walk all the way down even a really,
really long bucket chain. It has some code that reduces good_match
each time through, but it fails to produce a non-zero decrement once
good_match * good_drop < 100. So if we're searching an enormously
deep bucket many times in a row, and there are actually no matches,
we'll go flying down the whole linked list every time. I tried
mandating a minimum decrease of 1 and didn't observe that it made any
difference in the run time of this test case, but it still seems odd.
For the record, it's not new behavior with this patch; pglz_compress()
has the same issue as things stand today. I wonder if we ought to
decrease the good match length by a constant rather than a percentage
at each step.

- pglz_delta_encode() seems to spend about 50% of its CPU time loading
up the history data. It would be nice to find a way to reduce that
effort. I thought about maybe only making a history entry for, say,
every fourth input position rather than every one, but a quick test
seems to suggest that's a big fail, probably because it's too easy to
skip over the position where we would have made "the right" match via
some short match. But maybe there's some more sophisticated strategy
here that would work better. For example, see:

http://en.wikipedia.org/wiki/Rabin_fingerprint

The basic idea is that you use a rolling hash function to divide up
the history data into chunks of a given average size. So we scan the
history data, compute a rolling hash value at each point, and each
time the bottom N bits are zero, we consider that to be the end of a
chunk. We enter all the chunks into a hash table. The chunk size
will vary, but on the average, given a reasonably well-behaved rolling
hash function (the pglz one probably doesn't qualify) it'll happen
every 2^N bytes, so perhaps for this purpose we'd choose N to be
between 3 and 5. Then, we scan the input we want to compress and
divide it into chunks in the same way. Chunks that don't exist in the
history data get copied to the output, while those that do get
replaced with a reference to their position in the history data.

I'm not 100% certain that something like this would be better than
trying to leverage the pglz machinery, but I think it might be worth
trying. One possible advantage is that you make many fewer hash-table
entries, which reduces both the cost of setting up the hash table and
the cost of probing it; another is that if you find a hit in the hash
table, you needn't search any further: you are done; this is related
to the point that, for the most part, the processing is
chunk-at-a-time rather than character-at-a-time, which might be more
efficient. On the other hand, the compression ratio might stink, or
it might suck for some other reason: I just don't know.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Robert Haas (#38)

On Tue, Nov 26, 2013 at 8:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jul 22, 2013 at 2:31 PM, Greg Smith <greg@2ndquadrant.com> wrote:

I spent a little time running the tests from Heikki's script under
perf. On all three "two short fields" tests and also on the "ten tiny
fields, all changed" test, we spend about 1% of the CPU time in
pglz_delta_encode. I don't see any evidence that it's actually
compressing anything at all; it appears to be falling out where we
test the input length against the strategy, presumably because the
default strategy (which we largely copy here) doesn't try to compress
input data of less than 32 bytes. Given that this code isn't
actually compressing anything in these cases, I'm a bit confused by
Greg's report of substantial gains on "ten tiny fields, all changed"
test; how can we win if we're not compressing?

I think it is mainly due to variation of test data on Mac, if we see
the results of Linux, there is not much difference in results.
Also the results posted by Heikki or by me at below links doesn't show
such inconsistency for "ten tiny fields, all changed" case.

/messages/by-id/51366323.8070606@vmware.com
/messages/by-id/001f01ce1c14$d3af0770$7b0d1650$@kapila@huawei.com
(Refer test_readings.txt)

I studied the "hundred tiny fields, half nulled" test case in some
detail. Some thoughts:

- There is a comment "TODO: It would be nice to behave like the
history and the source strings were concatenated, so that you could
compress using the new data, too." If we're not already doing that,
then how are we managing to compress WAL by more than one-quarter in
the "hundred tiny fields, all changed" case?

Algorithm is not doing concatenation of history and source strings,
the hash table is formed just with history data and then later
if match is not found then it is added to history, so this can be the
reason for the above result.

It looks to me like the
patch IS doing that, and I'm not sure it's a good idea, especially
because it's using pglz_hist_add_no_recycle rather than pglz_add_hist:
we verify that hlen + slen < 2 * PGLZ_HISTORY_SIZE but that doesn't
seem good enough. On the "hundred tiny fields, half nulled" test case,
removing that line reduces compression somewhat but also saves on CPU
cycles.

- pglz_find_match() is happy to walk all the way down even a really,
really long bucket chain. It has some code that reduces good_match
each time through, but it fails to produce a non-zero decrement once
good_match * good_drop < 100. So if we're searching an enormously
deep bucket many times in a row, and there are actually no matches,
we'll go flying down the whole linked list every time. I tried
mandating a minimum decrease of 1 and didn't observe that it made any
difference in the run time of this test case, but it still seems odd.
For the record, it's not new behavior with this patch; pglz_compress()
has the same issue as things stand today. I wonder if we ought to
decrease the good match length by a constant rather than a percentage
at each step.

- pglz_delta_encode() seems to spend about 50% of its CPU time loading
up the history data. It would be nice to find a way to reduce that
effort. I thought about maybe only making a history entry for, say,
every fourth input position rather than every one, but a quick test
seems to suggest that's a big fail, probably because it's too easy to
skip over the position where we would have made "the right" match via
some short match. But maybe there's some more sophisticated strategy
here that would work better. For example, see:

http://en.wikipedia.org/wiki/Rabin_fingerprint

The basic idea is that you use a rolling hash function to divide up
the history data into chunks of a given average size. So we scan the
history data, compute a rolling hash value at each point, and each
time the bottom N bits are zero, we consider that to be the end of a
chunk. We enter all the chunks into a hash table. The chunk size
will vary, but on the average, given a reasonably well-behaved rolling
hash function (the pglz one probably doesn't qualify) it'll happen
every 2^N bytes, so perhaps for this purpose we'd choose N to be
between 3 and 5. Then, we scan the input we want to compress and
divide it into chunks in the same way. Chunks that don't exist in the
history data get copied to the output, while those that do get
replaced with a reference to their position in the history data.

I'm not 100% certain that something like this would be better than
trying to leverage the pglz machinery, but I think it might be worth
trying. One possible advantage is that you make many fewer hash-table
entries, which reduces both the cost of setting up the hash table and
the cost of probing it; another is that if you find a hit in the hash
table, you needn't search any further: you are done; this is related
to the point that, for the most part, the processing is
chunk-at-a-time rather than character-at-a-time, which might be more
efficient. On the other hand, the compression ratio might stink, or
it might suck for some other reason: I just don't know.

I think this idea looks better than current and it will definately
improve some of the cases, but not sure we can win in all cases.
We have tried one of the similar idea (reduce size of hash and
eventually comparision) by adding every 10 bytes to the history
lookup table rather than every byte. It improved most cases but not
all cases ("hundred tiny fields, all changed",
"hundred tiny fields, half changed" test were still slow).
Patch and results are at link (refer approach-1):
/messages/by-id/001f01ce1c14$d3af0770$7b0d1650$@kapila@huawei.com

Now the tough question is what are the possible options for this patch
and which one to pick:
a. optimize encoding technique, so that it can improve results in most
cases even if not all.
b. have a table level option or guc to enable/disable WAL compression.
c. use some heuristics to check if chances of compression are good,
then only perform compression.
1. apply this optimization for tuple size > 128 and < 2000
2. apply this optimization if number of modified columns are less
than 25% (some threshold number) of total columns.
I think we can get modified columns from target entry and use
it if triggers haven't changed that tuple. I remember
earlier there were concerns that this value can't be trusted
completely, but I think using it as a heuristic is not a
problem, even if this number is not right in some cases.
d. forget about this optimization and reject the patch.

I think by doing option 'b' and 'c' together can make this
optimization usable in cases where it is actually useful.

Opinions/Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Amit Kapila (#39)

On Wed, Nov 27, 2013 at 12:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

- There is a comment "TODO: It would be nice to behave like the
history and the source strings were concatenated, so that you could
compress using the new data, too." If we're not already doing that,
then how are we managing to compress WAL by more than one-quarter in
the "hundred tiny fields, all changed" case?

Algorithm is not doing concatenation of history and source strings,
the hash table is formed just with history data and then later
if match is not found then it is added to history, so this can be the
reason for the above result.

From the compressor's point of view, that's pretty much equivalent to
behaving as if those strings were concatenated.

The point is that there's a difference between using the old tuple's
history entries to compress the new tuple and using the new tuple's
own history to compress it. The former is delta-compression, which is
what we're supposedly doing here. The later is just plain
compression. That doesn't *necessarily* make it a bad idea, but they
are clearly two different things.

The basic idea is that you use a rolling hash function to divide up
the history data into chunks of a given average size. So we scan the
history data, compute a rolling hash value at each point, and each
time the bottom N bits are zero, we consider that to be the end of a
chunk. We enter all the chunks into a hash table. The chunk size
will vary, but on the average, given a reasonably well-behaved rolling
hash function (the pglz one probably doesn't qualify) it'll happen
every 2^N bytes, so perhaps for this purpose we'd choose N to be
between 3 and 5. Then, we scan the input we want to compress and
divide it into chunks in the same way. Chunks that don't exist in the
history data get copied to the output, while those that do get
replaced with a reference to their position in the history data.

I think this idea looks better than current and it will definately
improve some of the cases, but not sure we can win in all cases.
We have tried one of the similar idea (reduce size of hash and
eventually comparision) by adding every 10 bytes to the history
lookup table rather than every byte. It improved most cases but not
all cases ("hundred tiny fields, all changed",
"hundred tiny fields, half changed" test were still slow).
Patch and results are at link (refer approach-1):
/messages/by-id/001f01ce1c14$d3af0770$7b0d1650$@kapila@huawei.com

What you did there will, I think, tend to miss a lot of compression
opportunities. Suppose for example that the old tuple is
ABCDEFGHIJKLMNOP and the new tuple is xABCDEFGHIJKLMNOP. After
copying one literal byte we'll proceed to copy 9 more, missing the
fact that there was a long match available after the first byte. The
advantage of the fingerprinting technique is that it's supposed to be
resistant to that sort of thing.

Now the tough question is what are the possible options for this patch
and which one to pick:
a. optimize encoding technique, so that it can improve results in most
cases even if not all.
b. have a table level option or guc to enable/disable WAL compression.
c. use some heuristics to check if chances of compression are good,
then only perform compression.
1. apply this optimization for tuple size > 128 and < 2000
2. apply this optimization if number of modified columns are less
than 25% (some threshold number) of total columns.
I think we can get modified columns from target entry and use
it if triggers haven't changed that tuple. I remember
earlier there were concerns that this value can't be trusted
completely, but I think using it as a heuristic is not a
problem, even if this number is not right in some cases.
d. forget about this optimization and reject the patch.
I think by doing option 'b' and 'c' together can make this
optimization usable in cases where it is actually useful.

I agree that we probably want to do (b), and I suspect we want both a
GUC and a reloption, assuming that can be done relatively cleanly.

However, I think we should explore (a) more before we explore (c). I
think there's a good chance that we can reduce the CPU overhead of
this enough to feel comfortable having it enabled by default. If we
proceed with heuristics as in approach (c), I don't think that's the
end of the world, but I think there will be more corner cases where we
lose and have to fiddle things manually.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Robert Haas (#40)

On Wed, Nov 27, 2013 at 7:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 27, 2013 at 12:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The basic idea is that you use a rolling hash function to divide up
the history data into chunks of a given average size. So we scan the
history data, compute a rolling hash value at each point, and each
time the bottom N bits are zero, we consider that to be the end of a
chunk. We enter all the chunks into a hash table. The chunk size
will vary, but on the average, given a reasonably well-behaved rolling
hash function (the pglz one probably doesn't qualify) it'll happen
every 2^N bytes, so perhaps for this purpose we'd choose N to be
between 3 and 5. Then, we scan the input we want to compress and
divide it into chunks in the same way. Chunks that don't exist in the
history data get copied to the output, while those that do get
replaced with a reference to their position in the history data.

I think this idea looks better than current and it will definately
improve some of the cases, but not sure we can win in all cases.
We have tried one of the similar idea (reduce size of hash and
eventually comparision) by adding every 10 bytes to the history
lookup table rather than every byte. It improved most cases but not
all cases ("hundred tiny fields, all changed",
"hundred tiny fields, half changed" test were still slow).
Patch and results are at link (refer approach-1):
/messages/by-id/001f01ce1c14$d3af0770$7b0d1650$@kapila@huawei.com

What you did there will, I think, tend to miss a lot of compression
opportunities. Suppose for example that the old tuple is
ABCDEFGHIJKLMNOP and the new tuple is xABCDEFGHIJKLMNOP. After
copying one literal byte we'll proceed to copy 9 more, missing the
fact that there was a long match available after the first byte.

That is right, but one idea to try that out was to see if we can
reduce CPU usage at cost of compression,
but we found that it didn't completely eliminate that problem.

The
advantage of the fingerprinting technique is that it's supposed to be
resistant to that sort of thing.

Okay, one question arise here is that can it be better in terms of CPU
usage as compare to when
we have used hash function for every 10th byte, if you have a feeling
that it can improve situation,
I can try a prototype implementation of same to check the results.

Now the tough question is what are the possible options for this patch
and which one to pick:
a. optimize encoding technique, so that it can improve results in most
cases even if not all.
b. have a table level option or guc to enable/disable WAL compression.
c. use some heuristics to check if chances of compression are good,
then only perform compression.
1. apply this optimization for tuple size > 128 and < 2000
2. apply this optimization if number of modified columns are less
than 25% (some threshold number) of total columns.
I think we can get modified columns from target entry and use
it if triggers haven't changed that tuple. I remember
earlier there were concerns that this value can't be trusted
completely, but I think using it as a heuristic is not a
problem, even if this number is not right in some cases.
d. forget about this optimization and reject the patch.
I think by doing option 'b' and 'c' together can make this
optimization usable in cases where it is actually useful.

I agree that we probably want to do (b), and I suspect we want both a
GUC and a reloption, assuming that can be done relatively cleanly.

However, I think we should explore (a) more before we explore (c).

Sure, but to explore (a), the scope is bit bigger. We have below
options to explore (a):
1. try to optimize existing algorithm as used in patch, which we have
tried but ofcourse we can spend some more time to see if anything more
can be tried out.
2. try fingerprint technique as suggested by you above.
3. try some other standard methods like vcdiff, lz4 etc.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Amit Kapila (#41)

On Wed, Nov 27, 2013 at 9:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Sure, but to explore (a), the scope is bit bigger. We have below
options to explore (a):
1. try to optimize existing algorithm as used in patch, which we have
tried but ofcourse we can spend some more time to see if anything more
can be tried out.
2. try fingerprint technique as suggested by you above.
3. try some other standard methods like vcdiff, lz4 etc.

Well, obviously, I'm hot on idea #2 and think that would be worth
spending some time on. If we can optimize the algorithm used in the
patch some more (option #1), that would be fine, too, but the code
looks pretty tight to me, so I'm not sure how successful that's likely
to be. But if you have an idea, sure.

As to #3, I took a look at lz4 and snappy but neither seems to have an
API for delta compression. vcdiff is a commonly-used output for delta
compression but doesn't directly address the question of what
algorithm to use to find matches between the old and new input; and I
suspect that when you go searching for algorithms to do that
efficiently, it's going to bring you right back to #2.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Haribabu kommi

haribabu.kommi@huawei.com

about 12 years ago

In reply to: Robert Haas (#42)

3 attachment(s)

On 29 November 2013 03:05 Robert Haas wrote:

On Wed, Nov 27, 2013 at 9:31 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Sure, but to explore (a), the scope is bit bigger. We have below
options to explore (a):
1. try to optimize existing algorithm as used in patch, which we have
tried but ofcourse we can spend some more time to see if anything

more

can be tried out.
2. try fingerprint technique as suggested by you above.
3. try some other standard methods like vcdiff, lz4 etc.

Well, obviously, I'm hot on idea #2 and think that would be worth
spending some time on. If we can optimize the algorithm used in the
patch some more (option #1), that would be fine, too, but the code
looks pretty tight to me, so I'm not sure how successful that's likely
to be. But if you have an idea, sure.

I tried modifying the existing patch to support the dynamic rollup as follows.
For every 32 bytes mismatch between the old and new tuple and it resets back whenever it found a match.

1. pglz-with-micro-optimization-compress-using-newdata-5:

Adds all old tuple data to history and then check for the match from new tuple.
For every 32 bytes mismatch, it checks for the match for 2 bytes once. Like this
It repeats until it found a match or end of data.

2. pglz-with-micro-optimization-compress-using-newdata_snappy_hash-1:

Adds only first byte of old tuple data to the history and then check for the match
From new tuple. If any match found, then next unmatched byte from old tuple is added
To the history and repeats the process.

If no match founds then adds the next byte of the old tuple history followed by the
Unmatched byte from new tuple data to the history.

In this case the performance is good, but if there is any forward references in the
New data with old data then it will not compress the data.

Eg- old data - 12345 abcdefgh
New data - abcdefgh 56789

The updated patches and performance data is attached in the mail.
Please let me know your suggestions.

Regards,
Hari babu.

Attachments:

pglz-with-micro-optimization-compress-using-newdata-5.patchapplication/octet-stream; name=pglz-with-micro-optimization-compress-using-newdata-5.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..be87f9d 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,13 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+
+bool		wal_update_optimization = false;
+
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +624,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+							 encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					  oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c13f87c..074c722 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,12 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
 
-
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
@@ -6135,6 +6135,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -6144,15 +6150,49 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_optimization && wal_update_compression_ratio != 0
+		&& (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6179,9 +6219,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG94FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6991,7 +7034,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7006,7 +7052,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -7066,7 +7112,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7084,7 +7130,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -7108,7 +7154,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -7175,10 +7221,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -7194,7 +7261,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b68230d..b7482ea 100755
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2317,6 +2317,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 1c129b8..2e950c0 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -120,7 +120,7 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.  The size of the hash
+ *			back-pointers larger than that anyway.	The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
  *			zero, but reduces the number of hash collisions on long inputs.
@@ -202,7 +202,8 @@ typedef struct PGLZ_HistEntry
 {
 	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
 	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	uint32		hindex;			/* my current hash key */
+	bool		from_history;	/* Is the hash entry from history buffer? */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -265,12 +266,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
+#define pglz_hist_idx(_s,_e, _mask) (									\
+			((((_e) - (_s)) < 4) ? (int) (_s)[0] :					\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^							\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)						\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d)									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask)								\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -284,10 +313,9 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
 				if (__myhe->prev == NULL)									\
@@ -299,7 +327,7 @@ do {									\
 			}																\
 			__myhe->next = &(_he)[*__myhsp];								\
 			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
 			/* If there was an existing entry in this hash slot, link */	\
 			/* this new entry to it. However, the 0th entry in the */		\
@@ -317,6 +345,23 @@ do {									\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex, _from_history) \
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->prev = NULL;											\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			(_he)[(*__myhsp)].prev = __myhe;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+			__myhe->from_history = _from_history;							\
+} while (0)
 
 /* ----------
  * pglz_out_ctrl -
@@ -379,6 +424,49 @@ do { \
 
 
 /* ----------
+ * pglz_out_tag_encode -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination/history buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_tag_encode(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_from_history) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_from_history)														\
+		_ctrlb |= _ctrl;													\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pglz_out_literal_encode -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_literal_encode(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 2;															\
+} while (0)
+
+/* ----------
  * pglz_find_match -
  *
  *		Lookup the history table if the actual input stream matches
@@ -388,17 +476,21 @@ do { \
  */
 static inline int
 pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex, bool *from_history)
 {
 	PGLZ_HistEntry *hent;
 	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
+	bool		history_match = false;
+
+	*from_history = false;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
+	hentno = hstart[hindex];
 	hent = &hist_entries[hentno];
 	while (hent != INVALID_ENTRY_PTR)
 	{
@@ -406,11 +498,26 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		history_match = false;
+		maxlen = PGLZ_MAX_MATCH;
+		if (hent->from_history && (hend - hp < maxlen))
+			maxlen = hend - hp;
+		else if (end - input < maxlen)
+			maxlen = end - input;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
+		if (hent->from_history)
+		{
+			history_match = true;
+			thisoff = hend - hp;
+		}
+		else
+			thisoff = ip - hp;
+
 		if (thisoff >= 0x0fff)
 			break;
 
@@ -430,7 +537,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -440,7 +547,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -453,6 +560,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 */
 		if (thislen > len)
 		{
+			*from_history = history_match;
 			len = thislen;
 			off = thisoff;
 		}
@@ -488,6 +596,29 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -574,11 +705,14 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
-	 * hash table minimizes collision, but has a higher startup cost. For
-	 * a small input, the startup cost dominates. The table size must be
-	 * a power of two.
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
 	 */
 	if (slen < 128)
 		hashsz = 512;
@@ -603,6 +737,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
+		bool		from_history;
+
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -625,8 +762,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+							&match_off, good_match, good_drop, NULL, hindex,
+							&from_history))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -635,9 +774,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend, mask);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -651,7 +791,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend, mask);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -674,6 +814,220 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,
+				b,
+				c,
+				d;
+	int32		hindex;
+
+	/*
+	 * Tuples of source and history length greater than 2 * PGLZ_HISTORY_SIZE
+	 * are not allowed for delta encode as this is the maximum size of history
+	 * offset. And also tuples with history data less than 4 are not allowed.
+	 */
+	if (((hlen + slen) >= (2 * PGLZ_HISTORY_SIZE)) || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen + slen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	pglz_hash_init(hp, hindex, a, b, c, d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the new
+		 * data, too.
+		 */
+		pglz_hash_roll(hp, hindex, a, b, c, d, mask);
+		pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 hp, hend, hindex, true);
+		hp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, hindex, a, b, c, d);
+	while (dp < dend - 4)
+	{
+		bool		from_history;
+		int			skip = 32;
+
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp, hindex, a, b, c, d, mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex,
+							&from_history))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag_encode(ctrlp, ctrlb, ctrl, bp, match_len, match_off, from_history);
+			dp += match_len;
+			found_match = true;
+			pglz_hash_init(dp, hindex, a, b, c, d);
+			skip = 32;
+		}
+		else
+		{
+			int32 bytes_between_hash_lookups = skip++ >> 5;
+			int32 idx = 0;
+
+			/*
+			 * No match found. Copy literal byte.
+			 */
+			while ((idx < bytes_between_hash_lookups) && (dp < dend - 4))
+			{
+				pglz_hist_add_no_recycle(hist_start, hist_entries,
+										 hist_next,
+										 dp, dend, hindex, false);
+
+				pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+				dp++;				/* Do not do this ++ in the line above! */
+				idx++;
+			}
+
+			if ((bytes_between_hash_lookups > 1) && (dp < dend - 4))
+				pglz_hash_init(dp, hindex, a, b, c, d);
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -777,3 +1131,124 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc += 2)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT.
+				 */
+				if ((ctrl >> 1) & 1)
+				{
+					memcpy(dp, hend - off, len);
+					dp += len;
+				}
+				else
+				{
+					/*
+					 * Now we copy the bytes specified by the tag from OUTPUT
+					 * to OUTPUT. It is dangerous and platform dependent to
+					 * use memcpy() here, because the copied areas could
+					 * overlap extremely!
+					 */
+					while (len--)
+					{
+						*dp = dp[-off];
+						dp++;
+					}
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 2;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cbf3186..134ec57 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -762,6 +762,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"wal_update_optimization", PGC_USERSET, WAL_SETTINGS,
+			gettext_noop("Enables the WAL optimizations for update operation."),
+			NULL
+		},
+		&wal_update_optimization,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
@@ -2448,6 +2457,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7a18e72..1cf9c20 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -188,6 +188,9 @@
 #commit_delay = 0			# range 0-100000, in microseconds
 #commit_siblings = 5			# range 1-1000
 
+#wal_update_optimization = off		# Enables the WAL optimization for update
+					# operations
+
 # - Checkpoints -
 
 #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4381778..36e7dc8 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -148,12 +148,21 @@ typedef struct xl_heap_update
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
 	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+ 														 * update operation is
+ 														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 0a832e9..830349b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -689,6 +689,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 002862c..0a928d9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -262,6 +262,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 98ca553..0128f69 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -226,6 +226,9 @@ extern bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 
+extern bool wal_update_optimization;
+extern int	wal_update_compression_ratio;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

pglz-with-micro-optimization-compress-using-newdata_snappy_hash-1.patchapplication/octet-stream; name=pglz-with-micro-optimization-compress-using-newdata_snappy_hash-1.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..be87f9d 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,13 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+
+bool		wal_update_optimization = false;
+
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +624,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+							 encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					  oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c13f87c..074c722 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,12 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
 
-
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
@@ -6135,6 +6135,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -6144,15 +6150,49 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_optimization && wal_update_compression_ratio != 0
+		&& (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6179,9 +6219,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG94FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -6991,7 +7034,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7006,7 +7052,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -7066,7 +7112,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7084,7 +7130,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -7108,7 +7154,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -7175,10 +7221,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -7194,7 +7261,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b68230d..b7482ea 100755
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2317,6 +2317,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 1c129b8..a494d99 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -120,7 +120,7 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.  The size of the hash
+ *			back-pointers larger than that anyway.	The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
  *			zero, but reduces the number of hash collisions on long inputs.
@@ -202,7 +202,8 @@ typedef struct PGLZ_HistEntry
 {
 	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
 	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	uint32		hindex;			/* my current hash key */
+	bool		from_history;	/* Is the hash entry from history buffer? */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -265,12 +266,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
+#define pglz_hist_idx(_s,_e, _mask) (									\
+			((((_e) - (_s)) < 4) ? (int) (_s)[0] :					\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^							\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)						\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d)									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask)								\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -284,10 +313,9 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
 				if (__myhe->prev == NULL)									\
@@ -299,7 +327,7 @@ do {									\
 			}																\
 			__myhe->next = &(_he)[*__myhsp];								\
 			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
 			/* If there was an existing entry in this hash slot, link */	\
 			/* this new entry to it. However, the 0th entry in the */		\
@@ -317,6 +345,23 @@ do {									\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex, _from_history) \
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->prev = NULL;											\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			(_he)[(*__myhsp)].prev = __myhe;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+			__myhe->from_history = _from_history;							\
+} while (0)
 
 /* ----------
  * pglz_out_ctrl -
@@ -379,6 +424,49 @@ do { \
 
 
 /* ----------
+ * pglz_out_tag_encode -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination/history buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_tag_encode(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_from_history) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_from_history)														\
+		_ctrlb |= _ctrl;													\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pglz_out_literal_encode -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_literal_encode(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 2;															\
+} while (0)
+
+/* ----------
  * pglz_find_match -
  *
  *		Lookup the history table if the actual input stream matches
@@ -388,17 +476,21 @@ do { \
  */
 static inline int
 pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex, bool *from_history)
 {
 	PGLZ_HistEntry *hent;
 	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
+	bool		history_match = false;
+
+	*from_history = false;
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
+	hentno = hstart[hindex];
 	hent = &hist_entries[hentno];
 	while (hent != INVALID_ENTRY_PTR)
 	{
@@ -406,11 +498,26 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		history_match = false;
+		maxlen = PGLZ_MAX_MATCH;
+		if (hent->from_history && (hend - hp < maxlen))
+			maxlen = hend - hp;
+		else if (end - input < maxlen)
+			maxlen = end - input;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
+		if (hent->from_history)
+		{
+			history_match = true;
+			thisoff = hend - hp;
+		}
+		else
+			thisoff = ip - hp;
+
 		if (thisoff >= 0x0fff)
 			break;
 
@@ -430,7 +537,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -440,7 +547,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -453,6 +560,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		 */
 		if (thislen > len)
 		{
+			*from_history = history_match;
 			len = thislen;
 			off = thisoff;
 		}
@@ -488,6 +596,29 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -574,11 +705,14 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
-	 * hash table minimizes collision, but has a higher startup cost. For
-	 * a small input, the startup cost dominates. The table size must be
-	 * a power of two.
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
 	 */
 	if (slen < 128)
 		hashsz = 512;
@@ -603,6 +737,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
+		bool		from_history;
+
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -625,8 +762,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+							&match_off, good_match, good_drop, NULL, hindex,
+							&from_history))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -635,9 +774,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend, mask);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -651,7 +791,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend, mask);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -674,6 +814,225 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+ 	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,b,c,d;
+	int32		ha,hb,hc,hd;
+	int32		hindex;
+	int32		dindex;
+	int32 		skip = 32;
+
+	/*
+	 * Tuples of source and history length greater than 2 * PGLZ_HISTORY_SIZE
+	 * are not allowed for delta encode as this is the maximum size of history
+	 * offset. And also tuples with history data less than 4 are not allowed.
+	 */
+	if (((hlen + slen) >= (2 * PGLZ_HISTORY_SIZE)) || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen + slen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	pglz_hash_init(hp, hindex, ha, hb, hc, hd);
+	pglz_hash_roll(hp, hindex, ha, hb, hc, hd, mask);
+	pglz_hist_add_no_recycle(hist_start, hist_entries,
+							 hist_next,
+							 hp, hend, hindex, true);
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, dindex, a, b, c, d);
+	while (dp < dend - 4)
+	{
+		bool		from_history;
+
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp, dindex, a, b, c, d, mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, dindex,
+							&from_history))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag_encode(ctrlp, ctrlb, ctrl, bp, match_len, match_off, from_history);
+			dp += match_len;
+			hp = (hend - match_off) + match_len;
+			found_match = true;
+
+			pglz_hash_init(dp, dindex, a, b, c, d);
+			skip = 32;
+			if (hp < hend - 4)
+			{
+				pglz_hash_init(hp, hindex,ha,hb,hc,hd);
+				pglz_hash_roll(hp, hindex, ha,hb,hc,hd, mask);
+				pglz_hist_add_no_recycle(hist_start, hist_entries,
+										 hist_next, hp, hend, hindex, true);
+				hp++;
+			}
+		}
+		else
+		{
+			int32 bytes_between_hash_lookups = skip++ >> 5;
+			int32 idx = 0;
+
+			/*
+			 * No match found. Copy literal byte.
+			 */
+			if ((hp + bytes_between_hash_lookups) < hend - 4)
+			{
+				hp += bytes_between_hash_lookups;
+				pglz_hash_roll(hp, hindex, ha,hb,hc,hd, mask);
+				pglz_hist_add_no_recycle(hist_start, hist_entries,
+										 hist_next, hp, hend, hindex, true);
+			}
+
+			while ((idx < bytes_between_hash_lookups) && (dp < dend - 4))
+			{
+				pglz_hist_add_no_recycle(hist_start, hist_entries,
+										 hist_next,
+										 dp, dend, dindex, false);
+
+				pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+				dp++;				/* Do not do this ++ in the line above! */
+				idx++;
+			}
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal_encode(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -777,3 +1136,124 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc += 2)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT.
+				 */
+				if ((ctrl >> 1) & 1)
+				{
+					memcpy(dp, hend - off, len);
+					dp += len;
+				}
+				else
+				{
+					/*
+					 * Now we copy the bytes specified by the tag from OUTPUT
+					 * to OUTPUT. It is dangerous and platform dependent to
+					 * use memcpy() here, because the copied areas could
+					 * overlap extremely!
+					 */
+					while (len--)
+					{
+						*dp = dp[-off];
+						dp++;
+					}
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 2;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cbf3186..134ec57 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -762,6 +762,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"wal_update_optimization", PGC_USERSET, WAL_SETTINGS,
+			gettext_noop("Enables the WAL optimizations for update operation."),
+			NULL
+		},
+		&wal_update_optimization,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
@@ -2448,6 +2457,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7a18e72..1cf9c20 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -188,6 +188,9 @@
 #commit_delay = 0			# range 0-100000, in microseconds
 #commit_siblings = 5			# range 1-1000
 
+#wal_update_optimization = off		# Enables the WAL optimization for update
+					# operations
+
 # - Checkpoints -
 
 #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4381778..36e7dc8 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -148,12 +148,21 @@ typedef struct xl_heap_update
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
 	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+ 														 * update operation is
+ 														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 0a832e9..830349b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -689,6 +689,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 002862c..0a928d9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -262,6 +262,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 98ca553..0128f69 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -226,6 +226,9 @@ extern bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 
+extern bool wal_update_optimization;
+extern int	wal_update_compression_ratio;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

test_readings_with_rollup.txttext/plain; name=test_readings_with_rollup.txtDownload

#44

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Haribabu kommi (#43)

On Mon, Dec 2, 2013 at 7:40 PM, Haribabu kommi
<haribabu.kommi@huawei.com> wrote:

On 29 November 2013 03:05 Robert Haas wrote:

On Wed, Nov 27, 2013 at 9:31 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I tried modifying the existing patch to support the dynamic rollup as follows.
For every 32 bytes mismatch between the old and new tuple and it resets back whenever it found a match.

1. pglz-with-micro-optimization-compress-using-newdata-5:

Adds all old tuple data to history and then check for the match from new tuple.
For every 32 bytes mismatch, it checks for the match for 2 bytes once. Like this
It repeats until it found a match or end of data.

2. pglz-with-micro-optimization-compress-using-newdata_snappy_hash-1:

Adds only first byte of old tuple data to the history and then check for the match
From new tuple. If any match found, then next unmatched byte from old tuple is added
To the history and repeats the process.

If no match founds then adds the next byte of the old tuple history followed by the
Unmatched byte from new tuple data to the history.

In this case the performance is good, but if there is any forward references in the
New data with old data then it will not compress the data.

The performance data has still same problem that is on fast disks
(tempfs data) it is low.
I am already doing chunk-wise implementation to see if it can improve
the situation, please wait
and then we can decide what is the best way to proceed.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Robert Haas (#42)

2 attachment(s)

On Fri, Nov 29, 2013 at 3:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 27, 2013 at 9:31 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Sure, but to explore (a), the scope is bit bigger. We have below
options to explore (a):
1. try to optimize existing algorithm as used in patch, which we have
tried but ofcourse we can spend some more time to see if anything more
can be tried out.
2. try fingerprint technique as suggested by you above.
3. try some other standard methods like vcdiff, lz4 etc.

Well, obviously, I'm hot on idea #2 and think that would be worth
spending some time on. If we can optimize the algorithm used in the
patch some more (option #1), that would be fine, too, but the code
looks pretty tight to me, so I'm not sure how successful that's likely
to be. But if you have an idea, sure.

I have been experimenting chunk wise delta encoding (by using
technique similar to rabin fingerprint method) from last few days and
here are results of my investigation.

Performance Data
----------------------------
Non-default settings:
autovacuum =off
checkpoint_segments =128
checkpoint_timeout = 10min

unpatched

testname | wal_generated |
duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 1054921328 | 25.5855557918549
hundred tiny fields, all changed | 634483328 | 20.8992719650269
hundred tiny fields, half changed | 635948640 | 19.8670389652252
hundred tiny fields, half nulled | 571388552 |
18.9413228034973

lz-delta-encoding

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 662984384 | 21.7335519790649
hundred tiny fields, all changed | 633944320 | 24.1207830905914
hundred tiny fields, half changed | 633944344 | 24.4657719135284
hundred tiny fields, half nulled | 492200208 |
22.0337791442871

rabin-delta-encoding

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 662235752 | 20.1823079586029
hundred tiny fields, all changed | 633950080 | 22.0473308563232
hundred tiny fields, half changed | 633950880 | 21.8351459503174
hundred tiny fields, half nulled | 508943072 |
20.9554698467255

Results Summarization
-------------------------------------
1. With Chunkwise approach, WAL reduction is almost same as with LZ
barring half nulled case which can be improved.
2. With Chunkwise approach, CPU usage is reduced to 50% in most cases
where there is less or no compression,
still there is 5~10% overhead for cases where data is not
compressible. I think there will certainly a small
overhead of forming hash table and scanning to conclude data is
non-compressible.
3. I have not tested other tests which will anyway return from top of
encoding function due to tuple length less than 32.

Main reasons of improvement
---------------------------------------------
1. lesser hash entries for old tuple and lesser calculations during
compressing of new tuple.
2. memset for data structure related to hash table for lesser size
3. Don't copy into output buffer untill we found match.

Further Actions
------------------------
1. Need to decide if this reduction in CPU usage is acceptable, do we
need enable/disable flag at table level.
2. We can do further micro-optimisations in chunk wise approach like
hash function improvement.
3. Some code improvements are pending like for cases where data to be
compressed is non-contiguous.

Attached files
---------------------
1. pgrb_delta_encoding_v1 - In heaptuple.c, there is a parameter
rabin_fingerprint_comp, set it to true for
chunkwise delta encoding and set it to false for lz encoding. By
default it is true. I wanted to provide
better way to enable both modes and tried as well but end up with this way.
2. wal-update-testsuite.sh - test script developed by Heikki to test this patch.

Note -
a. Performance is data is taken on my laptop, needs to be tested on
some better m/c
b. Attached Patch is just a prototype of chunkwise concept, code needs
to be improved and decode
handling/test is pending.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pgrb_delta_encoding_v1.patchapplication/octet-stream; name=pgrb_delta_encoding_v1.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..ecf8748 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,14 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+
+bool		wal_update_optimization = false;
+bool		rabin_fingerprint_comp = true;
+
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +625,59 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+	/* strategy.min_comp_rate = 50; */
+
+	if (rabin_fingerprint_comp)
+		return pgrb_delta_encode(
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+							 encdata, enclen, &strategy
+		);
+	else
+		return pglz_delta_encode(
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+							 encdata, enclen, &strategy
+			);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					  oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8d59620..d3077f6 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,12 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
 
-
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
@@ -6135,6 +6135,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -6144,15 +6150,49 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (/*wal_update_optimization && */wal_update_compression_ratio != 0
+		&& (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6179,9 +6219,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG94FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7034,7 +7077,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7049,7 +7095,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -7109,7 +7155,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7127,7 +7173,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -7151,7 +7197,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -7218,10 +7264,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -7237,7 +7304,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b68230d..b7482ea 100755
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2317,6 +2317,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 1c129b8..6481b63 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -120,7 +120,7 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.  The size of the hash
+ *			back-pointers larger than that anyway.	The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
  *			zero, but reduces the number of hash collisions on long inputs.
@@ -186,6 +186,14 @@
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
 #define PGLZ_HISTORY_SIZE		4096
 #define PGLZ_MAX_MATCH			273
+#define PGRB_HKEY_PRIME			11	 /* prime number used for rolling hash */
+#define PGRB_HKEY_SQUARE_PRIME			11 * 11	 /* prime number used for rolling hash */
+#define PGRB_HKEY_CUBE_PRIME			11 * 11	 * 11 /* prime number used for rolling hash */
+/* number of bits after which to check for constant pattern to form chunk */
+#define PGRB_PATTERN_AFTER_BITS	4
+#define PGRB_CONST_NUM			(1 << PGRB_PATTERN_AFTER_BITS)
+#define PGRB_MIN_CHUNK_SIZE		4
+#define PGRB_MAX_CHUNK_SIZE		4
 
 
 /* ----------
@@ -202,7 +210,7 @@ typedef struct PGLZ_HistEntry
 {
 	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
 	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	uint32		hindex;			/* my current hash key */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -239,12 +247,21 @@ static const PGLZ_Strategy strategy_always_data = {
 const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
 
 
+typedef struct PGRB_HistEntry
+{
+	struct PGRB_HistEntry *next;	/* links for my hash key's list */
+	uint32		hindex;			/* my current hash key */
+	const char *ck_start_pos;	/* chunk start position */
+	int16       ck_size;		/* chunk end position */
+} PGRB_HistEntry;
+
 /* ----------
  * Statically allocated work arrays for history
  * ----------
  */
 static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
 static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+static PGRB_HistEntry rb_hist_entries[PGLZ_HISTORY_SIZE + 1];
 
 /*
  * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
@@ -252,6 +269,7 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  */
 #define INVALID_ENTRY			0
 #define INVALID_ENTRY_PTR		(&hist_entries[INVALID_ENTRY])
+#define RB_INVALID_ENTRY_PTR		(&rb_hist_entries[INVALID_ENTRY])
 
 /* ----------
  * pglz_hist_idx -
@@ -265,12 +283,70 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
+#define pglz_hist_idx(_s,_e, _mask) (									\
+			((((_e) - (_s)) < 4) ? (int) (_s)[0] :					\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^							\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)						\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d)									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask)								\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
+#define pgrb_hash_init(_p,hindex,a,b,c,d)									\
+	do {																	\
+			a = _p[0];														\
+			b = _p[1];														\
+			c = _p[2];														\
+			d = _p[3];														\
+			hindex = (a * PGRB_HKEY_CUBE_PRIME + b * PGRB_HKEY_SQUARE_PRIME + c * PGRB_HKEY_PRIME + d);						\
+	} while (0)
+
+#define pgrb_hash_roll(_p,hindex,a,b,c,d)								    \
+	do {																	\
+		/* subtract old a, 1000 % 11 = 10 */								\
+		hindex -= (a * PGRB_HKEY_CUBE_PRIME);												\
+		/* add new byte */											        \
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = PGRB_HKEY_PRIME * hindex + d;											\
+	} while (0)
+
+#define pgrb_hist_add_no_recycle(_hs,_he,_hn,_s,_ck_size, _hindex) \
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGRB_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->hindex = _hindex;										\
+			__myhe->ck_start_pos  = (_s);									\
+			__myhe->ck_size  = (_ck_size);									\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -284,10 +360,9 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
 				if (__myhe->prev == NULL)									\
@@ -299,7 +374,7 @@ do {									\
 			}																\
 			__myhe->next = &(_he)[*__myhsp];								\
 			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
 			/* If there was an existing entry in this hash slot, link */	\
 			/* this new entry to it. However, the 0th entry in the */		\
@@ -317,6 +392,22 @@ do {									\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex) \
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->prev = NULL;											\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			(_he)[(*__myhsp)].prev = __myhe;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
 
 /* ----------
  * pglz_out_ctrl -
@@ -379,6 +470,49 @@ do { \
 
 
 /* ----------
+ * pglz_out_tag_encode -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination/history buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_tag_encode(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_from_history) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_from_history)														\
+		_ctrlb |= _ctrl;													\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pglz_out_literal_encode -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_literal_encode(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 2;															\
+} while (0)
+
+/* ----------
  * pglz_find_match -
  *
  *		Lookup the history table if the actual input stream matches
@@ -388,17 +522,20 @@ do { \
  */
 static inline int
 pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex)
 {
 	PGLZ_HistEntry *hent;
 	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
+	bool		history_match = false;
+
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
+	hentno = hstart[hindex];
 	hent = &hist_entries[hentno];
 	while (hent != INVALID_ENTRY_PTR)
 	{
@@ -406,11 +543,23 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		history_match = false;
+		maxlen = PGLZ_MAX_MATCH;
+		if (input - end > maxlen)
+			maxlen = input - end;
+		if (hend && (hend - hp > maxlen))
+			maxlen = hend - hp;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
+		if (!hend)
+			thisoff = ip - hp;
+		else
+			thisoff = hend - hp;
+
 		if (thisoff >= 0x0fff)
 			break;
 
@@ -430,7 +579,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -440,7 +589,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -488,6 +637,103 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	return 0;
 }
 
+/* ----------
+ * pgrb_find_match -
+ *
+ * ----------
+ */
+static inline int
+pgrb_find_match(int16 *hstart, const char* input_chunk_start,
+			    int input_chunk_size, int16 *lenp, int16 *offp,
+				const char *hend, int hindex)
+{
+	PGRB_HistEntry *hent;
+	int16		hentno;
+	const char  *hp;
+	const char  *ip;
+	int16		history_chunk_size;
+	int16		matchlen;
+	bool		match_chunk;
+
+	hentno = hstart[hindex];
+	hent = &rb_hist_entries[hentno];
+
+	while (hent != RB_INVALID_ENTRY_PTR)
+	{
+		/*
+		 * if history and input chunk size doesn't match, then chunks cannot
+		 * match.
+		 */
+		history_chunk_size = hent->ck_size;
+		if (history_chunk_size != input_chunk_size)
+			return 0;
+
+		match_chunk = true;
+		matchlen = history_chunk_size;
+		hp = hent->ck_start_pos;
+		ip = input_chunk_start;
+
+		/*
+		 * first try to match uptil chunksize and if the data is
+		 * same for chunk, then try to match further to get the
+		 * larger match. if there is a match at end of chunk, it
+		 * is possible that further bytes in string will match.
+		 */
+		while (history_chunk_size-- > 0)
+		{
+			if (*hp++ != *ip++)
+				match_chunk = false;
+			else
+				match_chunk = true;
+		}
+
+		if (match_chunk)
+		{
+			while (*ip == *hp)
+			{
+				matchlen++;
+				ip++;
+				hp++;
+			}
+		}
+		else
+		{
+			hent = hent->next;
+			continue;
+		}
+
+		*offp = hend - hent->ck_start_pos;
+		*lenp = matchlen;
+
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -574,11 +820,14 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
-	 * hash table minimizes collision, but has a higher startup cost. For
-	 * a small input, the startup cost dominates. The table size must be
-	 * a power of two.
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
 	 */
 	if (slen < 128)
 		hashsz = 512;
@@ -603,6 +852,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
+
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -625,8 +876,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+							&match_off, good_match, good_drop, NULL, hindex))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -635,9 +887,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend, mask);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -651,7 +904,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend, mask);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -674,6 +927,441 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,
+				b,
+				c,
+				d;
+	int32		hindex;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	pglz_hash_init(hp, hindex, a, b, c, d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the new
+		 * data, too.
+		 */
+		pglz_hash_roll(hp, hindex, a, b, c, d, mask);
+		pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 hp, hend, hindex);
+		hp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, hindex, a, b, c, d);
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp, hindex, a, b, c, d, mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			dp += match_len;
+			found_match = true;
+			pglz_hash_init(dp, hindex, a, b, c, d);
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.	*/
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/*
+ * Rabin's Delta encoding.
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dp_chunk_start;
+	const char *dend = source + slen;
+	const char *dp_unmatched_chunk_start;
+	const char *hp = history;
+	const char *hp_chunk_start;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	bool		unmatched_data = false;
+	int16		match_len = 0;
+	int16		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		need_rate;
+	int			hist_next = 1;
+	int			hashsz;
+	int			mask;
+	int32		a,
+				b,
+				c,
+				d;
+	int32		hindex;
+	int16		len = 0;
+	int16		dp_chunk_size = 0;
+	int16		hp_chunk_size = 0;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	/* hashsz = choose_hash_size(hlen + slen); */
+	hashsz = choose_hash_size(hlen/PGRB_MIN_CHUNK_SIZE);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	hp_chunk_start = hp;
+	pgrb_hash_init(hp, hindex, a, b, c, d);
+	while (hp < hend - 4)
+	{
+		/* 
+		 * if we found the special pattern or reached max chunk size,
+		 * then consider it as a chunk and add the same to history
+		 * table.
+		 */
+		if ((hp_chunk_size >= PGRB_MIN_CHUNK_SIZE &&
+			(hindex % PGRB_CONST_NUM) == 0) ||
+			hp_chunk_size == PGRB_MAX_CHUNK_SIZE)
+		{
+			pgrb_hist_add_no_recycle(hist_start, rb_hist_entries,
+									 hist_next, hp_chunk_start,
+									 hp_chunk_size, (hindex & mask));
+			hp++;
+			hp_chunk_start = hp;
+			hp_chunk_size = 0;
+		}
+		else
+		{
+			hp++;					/* Do not do this ++ in the line above! */
+			hp_chunk_size++;
+		}
+		pgrb_hash_roll(hp, hindex, a, b, c, d);
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	dp_chunk_start = dp;
+	dp_unmatched_chunk_start = dp;
+	pgrb_hash_init(dp, hindex, a, b, c, d);
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (found_match)
+		{
+			if (bp - bstart >= result_max)
+				return false;
+		}
+		else
+		{
+			if (len >= result_max)
+				return false;
+		}
+
+		/*
+		 * Try to find a match in the history
+		 */
+		if ((dp_chunk_size >= PGRB_MIN_CHUNK_SIZE &&
+			(hindex % PGRB_CONST_NUM) == 0) ||
+			dp_chunk_size == PGRB_MAX_CHUNK_SIZE)
+		{
+			if (pgrb_find_match(hist_start, dp_chunk_start,
+								dp_chunk_size, &match_len, &match_off,
+								hend, (hindex & mask)))
+			{
+				/*
+				 * Create the tag and add history entries for all matched
+				 * characters and ensure to copy any unmatched data till
+				 * this point. Currently this code only delays copy of 
+				 * unmatched data in begining.
+				 */
+				if (unmatched_data)
+				{
+					while (dp_unmatched_chunk_start < dp)
+					{
+						   pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_unmatched_chunk_start);
+						   dp_unmatched_chunk_start++;
+					}
+					unmatched_data = false;
+				}
+				pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+				found_match = true;
+				if (match_len > dp_chunk_size)
+					dp += match_len - dp_chunk_size;
+			}
+			else
+			{
+				/*
+				 * No match found, add chunk to history table and
+				 * copy chunk into destination buffer.
+				 */
+				if (found_match)
+				{
+					while (dp_chunk_start < dp)
+					{
+						   pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_chunk_start);
+						   dp_chunk_start++;
+					}
+					/* The macro would do it four times - Jan.	*/
+				}
+				else
+					unmatched_data = true;
+			}
+			len++;
+			dp++;				/* Do not do this ++ in the line above! */
+			dp_chunk_start = dp;
+			dp_chunk_size = 0;
+		}
+		else
+		{
+			dp_chunk_size++;
+			len++;
+			dp++;
+		}
+		pgrb_hash_roll(dp, hindex, a, b, c, d);
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -777,3 +1465,124 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc += 2)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT.
+				 */
+				if ((ctrl >> 1) & 1)
+				{
+					memcpy(dp, hend - off, len);
+					dp += len;
+				}
+				else
+				{
+					/*
+					 * Now we copy the bytes specified by the tag from OUTPUT
+					 * to OUTPUT. It is dangerous and platform dependent to
+					 * use memcpy() here, because the copied areas could
+					 * overlap extremely!
+					 */
+					while (len--)
+					{
+						*dp = dp[-off];
+						dp++;
+					}
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 2;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cbf3186..134ec57 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -762,6 +762,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"wal_update_optimization", PGC_USERSET, WAL_SETTINGS,
+			gettext_noop("Enables the WAL optimizations for update operation."),
+			NULL
+		},
+		&wal_update_optimization,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
@@ -2448,6 +2457,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7a18e72..1cf9c20 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -188,6 +188,9 @@
 #commit_delay = 0			# range 0-100000, in microseconds
 #commit_siblings = 5			# range 1-1000
 
+#wal_update_optimization = off		# Enables the WAL optimization for update
+					# operations
+
 # - Checkpoints -
 
 #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 63b73d0..1aa5597 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -148,12 +148,21 @@ typedef struct xl_heap_update
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
 	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+ 														 * page's all visible
+ 														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+ 														 * update operation is
+ 														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 0a832e9..830349b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -689,6 +689,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 002862c..0a928d9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -262,6 +262,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 98ca553..0128f69 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -226,6 +226,9 @@ extern bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 
+extern bool wal_update_optimization;
+extern int	wal_update_compression_ratio;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..09a2214 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,15 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

wal-update-testsuite.shapplication/x-sh; name=wal-update-testsuite.shDownload

#46

Haribabu kommi

haribabu.kommi@huawei.com

about 12 years ago

In reply to: Amit Kapila (#45)

1 attachment(s)

On 05 December 2013 21:16 Amit Kapila wrote:

Note -
a. Performance is data is taken on my laptop, needs to be tested on
some better m/c b. Attached Patch is just a prototype of chunkwise
concept, code needs to be improved and decode
handling/test is pending.

I ran the performance test on linux machine and attached the results in the mail.

Regards,
Hari babu.

#47

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Haribabu kommi (#46)

On Fri, Dec 6, 2013 at 12:10 PM, Haribabu kommi
<haribabu.kommi@huawei.com> wrote:

On 05 December 2013 21:16 Amit Kapila wrote:

Note -
a. Performance is data is taken on my laptop, needs to be tested on
some better m/c b. Attached Patch is just a prototype of chunkwise
concept, code needs to be improved and decode
handling/test is pending.

I ran the performance test on linux machine and attached the results in the mail.

This test doesn't make much sense for comparison as in chunkwise delta
encoding, I am not doing compression using new tuple
and the reason is that I want to check how good/bad it is as compare
to LZ approach for cases when data is non-compressible.
So could you please try to take readings by using patch
pgrb_delta_encoding_v1 attached in my previous mail.

For LZ delta encoding-
pgrb_delta_encoding_v1 - In heaptuple.c, there is a parameter
rabin_fingerprint_comp, set it to false, compile the code and take
readings.
This will do LZ compression.

For chunk wise delta encoding - In heaptuple.c, there is a parameter
rabin_fingerprint_comp, set it to true, compile the code and take
readings
This will operate chunk wise.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Haribabu kommi

haribabu.kommi@huawei.com

about 12 years ago

In reply to: Amit Kapila (#47)

2 attachment(s)

On 06 December 2013 12:29 Amit Kapila wrote:

On Fri, Dec 6, 2013 at 12:10 PM, Haribabu kommi
<haribabu.kommi@huawei.com> wrote:

On 05 December 2013 21:16 Amit Kapila wrote:

Note -
a. Performance is data is taken on my laptop, needs to be tested on
some better m/c b. Attached Patch is just a prototype of chunkwise
concept, code needs to be improved and decode
handling/test is pending.

I ran the performance test on linux machine and attached the results

in the mail.

This test doesn't make much sense for comparison as in chunkwise delta
encoding, I am not doing compression using new tuple and the reason is
that I want to check how good/bad it is as compare to LZ approach for
cases when data is non-compressible.
So could you please try to take readings by using patch
pgrb_delta_encoding_v1 attached in my previous mail.

For LZ delta encoding-
pgrb_delta_encoding_v1 - In heaptuple.c, there is a parameter
rabin_fingerprint_comp, set it to false, compile the code and take
readings.
This will do LZ compression.

For chunk wise delta encoding - In heaptuple.c, there is a parameter
rabin_fingerprint_comp, set it to true, compile the code and take
readings
This will operate chunk
wise.

I ran the performance test on above patches including another patch which
Does snappy hash instead of normal hash in LZ algorithm. The performance
Readings and patch with snappy hash not including new data in compression
are attached in the mail.

The chunk wise approach is giving good performance in most of the scenarios.

Regards,
Hari babu.

Attachments:

pglz-with-micro-optimization-snappy-hash-1.patchapplication/octet-stream; name=pglz-with-micro-optimization-snappy-hash-1.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..be87f9d 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,13 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+
+bool		wal_update_optimization = false;
+
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +624,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pglz_delta_encode(
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+							 encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					  oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2035a21..f1963dd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,12 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
 
-
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
@@ -6148,6 +6148,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[4];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -6157,15 +6163,49 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+	if (wal_update_optimization && wal_update_compression_ratio != 0
+		&& (oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
+	xlrec.flags = 0;
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
 	xlrec.old_infobits_set = compute_infobits(oldtup->t_data->t_infomask,
 											  oldtup->t_data->t_infomask2);
 	xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
-	xlrec.all_visible_cleared = all_visible_cleared;
+	if (all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
 	xlrec.newtid = newtup->t_self;
-	xlrec.new_all_visible_cleared = new_all_visible_cleared;
+	if (new_all_visible_cleared)
+		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6192,9 +6232,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data follows .........
+	 * OR PG94FORMAT [If encoded]: LZ header + Encoded data follows
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7047,7 +7090,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7062,7 +7108,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
@@ -7122,7 +7168,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7140,7 +7186,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	/* Mark the page as a candidate for pruning */
 	PageSetPrunable(page, record->xl_xid);
 
-	if (xlrec->all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	/*
@@ -7164,7 +7210,7 @@ newt:;
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
 	 */
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 	{
 		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
 		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
@@ -7231,10 +7277,31 @@ newsame:;
 		   SizeOfHeapHeader);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.t_infomask2;
 	htup->t_infomask = xlhdr.t_infomask;
@@ -7250,7 +7317,7 @@ newsame:;
 	if (offnum == InvalidOffsetNumber)
 		elog(PANIC, "heap_update_redo: failed to add tuple");
 
-	if (xlrec->new_all_visible_cleared)
+	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
 		PageClearAllVisible(page);
 
 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b68230d..b7482ea 100755
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2317,6 +2317,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 1c129b8..c7aa54b 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -120,7 +120,7 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.  The size of the hash
+ *			back-pointers larger than that anyway.	The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
  *			zero, but reduces the number of hash collisions on long inputs.
@@ -202,7 +202,7 @@ typedef struct PGLZ_HistEntry
 {
 	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
 	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	uint32		hindex;			/* my current hash key */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -265,12 +265,40 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * hash keys more.
  * ----------
  */
-#define pglz_hist_idx(_s,_e, _mask) (										\
-			((((_e) - (_s)) < 4) ? (int) (_s)[0] :							\
-			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
-			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
+#define pglz_hist_idx(_s,_e, _mask) (									\
+			((((_e) - (_s)) < 4) ? (int) (_s)[0] :					\
+			 (((_s)[0] << 6) ^ ((_s)[1] << 4) ^							\
+			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)						\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d)									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask)								\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -284,10 +312,9 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
 				if (__myhe->prev == NULL)									\
@@ -299,7 +326,7 @@ do {									\
 			}																\
 			__myhe->next = &(_he)[*__myhsp];								\
 			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
 			/* If there was an existing entry in this hash slot, link */	\
 			/* this new entry to it. However, the 0th entry in the */		\
@@ -317,6 +344,22 @@ do {									\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex)				\
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->prev = NULL;											\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			(_he)[(*__myhsp)].prev = __myhe;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
 
 /* ----------
  * pglz_out_ctrl -
@@ -377,7 +420,6 @@ do { \
 	}																		\
 } while (0)
 
-
 /* ----------
  * pglz_find_match -
  *
@@ -388,7 +430,8 @@ do { \
  */
 static inline int
 pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex)
 {
 	PGLZ_HistEntry *hent;
 	int16		hentno;
@@ -398,7 +441,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
+	hentno = hstart[hindex];
 	hent = &hist_entries[hentno];
 	while (hent != INVALID_ENTRY_PTR)
 	{
@@ -406,11 +449,16 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		maxlen = PGLZ_MAX_MATCH;
+		if (hend - hp < maxlen)
+			maxlen = hend - hp;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
+		thisoff = hend - hp;
 		if (thisoff >= 0x0fff)
 			break;
 
@@ -430,7 +478,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -440,7 +488,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -488,6 +536,29 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	return 0;
 }
 
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -574,11 +645,14 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
-	 * hash table minimizes collision, but has a higher startup cost. For
-	 * a small input, the startup cost dominates. The table size must be
-	 * a power of two.
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
 	 */
 	if (slen < 128)
 		hashsz = 512;
@@ -603,6 +677,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
+
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -625,8 +701,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+							&match_off, good_match, good_drop, NULL, hindex))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -635,9 +712,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend, mask);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -651,7 +729,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend, mask);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -674,6 +752,217 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,b,c,d;
+	int32		ha,hb,hc,hd;
+	int32		hindex;
+	int32		dindex;
+	int32 		skip = 32;
+
+	/*
+	 * Tuples of source and history length greater than 2 * PGLZ_HISTORY_SIZE
+	 * are not allowed for delta encode as this is the maximum size of history
+	 * offset. And also tuples with history data less than 4 are not allowed.
+	 */
+	if ((hlen >= PGLZ_HISTORY_SIZE) || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	pglz_hash_init(hp, hindex, ha, hb, hc, hd);
+	pglz_hash_roll(hp, hindex, ha, hb, hc, hd, mask);
+	pglz_hist_add_no_recycle(hist_start, hist_entries,
+							 hist_next, hp, hend, hindex);
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, dindex, a, b, c, d);
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp, dindex, a, b, c, d, mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, dindex))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			dp += match_len;
+			hp = (hend - match_off) + match_len;
+			found_match = true;
+
+			pglz_hash_init(dp, dindex, a, b, c, d);
+			skip = 32;
+			if (hp < hend - 4)
+			{
+				pglz_hash_init(hp, hindex,ha,hb,hc,hd);
+				pglz_hash_roll(hp, hindex, ha,hb,hc,hd, mask);
+				pglz_hist_add_no_recycle(hist_start, hist_entries,
+										 hist_next, hp, hend, hindex);
+				hp++;
+			}
+		}
+		else
+		{
+			int32 bytes_between_hash_lookups = skip++ >> 5;
+			int32 idx = 0;
+
+			/*
+			 * No match found. Copy literal byte.
+			 */
+			if ((hp + bytes_between_hash_lookups) < hend - 4)
+			{
+				hp += bytes_between_hash_lookups;
+				pglz_hash_roll(hp, hindex, ha,hb,hc,hd, mask);
+				pglz_hist_add_no_recycle(hist_start, hist_entries,
+										 hist_next, hp, hend, hindex);
+			}
+
+			while ((idx < bytes_between_hash_lookups) && (dp < dend - 4))
+			{
+				pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+				dp++;				/* Do not do this ++ in the line above! */
+				idx++;
+			}
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -777,3 +1066,124 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc += 2)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT.
+				 */
+				if ((ctrl >> 1) & 1)
+				{
+					memcpy(dp, hend - off, len);
+					dp += len;
+				}
+				else
+				{
+					/*
+					 * Now we copy the bytes specified by the tag from OUTPUT
+					 * to OUTPUT. It is dangerous and platform dependent to
+					 * use memcpy() here, because the copied areas could
+					 * overlap extremely!
+					 */
+					while (len--)
+					{
+						*dp = dp[-off];
+						dp++;
+					}
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 2;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cbf3186..134ec57 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -762,6 +762,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"wal_update_optimization", PGC_USERSET, WAL_SETTINGS,
+			gettext_noop("Enables the WAL optimizations for update operation."),
+			NULL
+		},
+		&wal_update_optimization,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
@@ -2448,6 +2457,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7a18e72..1cf9c20 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -188,6 +188,9 @@
 #commit_delay = 0			# range 0-100000, in microseconds
 #commit_siblings = 5			# range 1-1000
 
+#wal_update_optimization = off		# Enables the WAL optimization for update
+					# operations
+
 # - Checkpoints -
 
 #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 63b73d0..eab8394 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -148,12 +148,21 @@ typedef struct xl_heap_update
 	TransactionId new_xmax;		/* xmax of the new tuple */
 	ItemPointerData newtid;		/* new inserted tuple id */
 	uint8		old_infobits_set;		/* infomask bits to set on old tuple */
-	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
-	bool		new_all_visible_cleared;		/* same for the page of newtid */
+	uint8		flags;			/* flag bits, see below */
 	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
 } xl_heap_update;
 
-#define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
+#define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01	/* Indicates as old
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02	/* Indicates as new
+														 * page's all visible
+														 * bit is cleared */
+#define XL_HEAP_UPDATE_DELTA_ENCODED			0x04	/* Indicates as the
+														 * update operation is
+														 * delta encoded */
+
+#define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
  * This is what we need to know about vacuum page cleanup/redirect
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 0a832e9..830349b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -689,6 +689,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 002862c..0a928d9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -262,6 +262,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 98ca553..0128f69 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -226,6 +226,9 @@ extern bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 
+extern bool wal_update_optimization;
+extern int	wal_update_compression_ratio;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..5add61a 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,12 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

test_readings_without_newdata.txttext/plain; name=test_readings_without_newdata.txtDownload

#49

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Haribabu kommi (#48)

On Fri, Dec 6, 2013 at 3:39 PM, Haribabu kommi
<haribabu.kommi@huawei.com> wrote:

On 06 December 2013 12:29 Amit Kapila wrote:

Note -
a. Performance is data is taken on my laptop, needs to be tested on
some better m/c b. Attached Patch is just a prototype of chunkwise
concept, code needs to be improved and decode
handling/test is pending.

I ran the performance test on linux machine and attached the results

in the mail.

I ran the performance test on above patches including another patch which
Does snappy hash instead of normal hash in LZ algorithm. The performance
Readings and patch with snappy hash not including new data in compression
are attached in the mail.

Thanks for taking the data.

The chunk wise approach is giving good performance in most of the scenarios.

Agreed, summarization of data for LZ/Chunkwise encoding especially for
non-compressible (hundred tiny fields, all changed/half changed) or less
compressible data (hundred tiny fields, half nulled) w.r.t CPU
usage is as below:

a. For hard disk, there is an overhead of 7~16% with LZ delta encoding
and there is an overhead of 5~8% with Chunk wise encoding.

b. For Tempfs (which means operate on RAM as disk), there is an
overhead of 19~26%
with LZ delta encoding and there is an overhead of 9~18% with
Chunk wise encoding.

There might be some variation of data (in your last mail the overhead
for chunkwise method for Tempfs was < 12%),
but in general the data suggests that chunk wise encoding has less
overhead than LZ encoding for non-compressible data
and for others it is better or equal.

Now, I think we have below options for this patch:
a. If the performance overhead for worst case is acceptable (we can
try to reduce some more, but don't think it will be something
drastic),
then this can be done without any flag/option.
b. Have it with table level option to enable/disable WAL compression
c. Drop this patch, as for worst cases there is some performance overhead.
d. Go back and work more on it, if there is any further suggestions
for improvement.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Peter Eisentraut

peter_e@gmx.net

about 12 years ago

In reply to: Amit Kapila (#45)

1 attachment(s)

This patch fails the regression tests; see attachment.

Attachments:

regression.diffstext/plain; charset=UTF-8; name=regression.diffs; x-mac-creator=0; x-mac-type=0Download

*** /var/lib/jenkins/jobs/postgresql_commitfest_world/workspace/src/test/regress/expected/arrays.out	2013-08-24 01:24:42.000000000 +0000
--- /var/lib/jenkins/jobs/postgresql_commitfest_world/workspace/src/test/regress/results/arrays.out	2013-12-07 16:21:48.000000000 +0000
***************
*** 495,508 ****
  (11 rows)
  
  SELECT * FROM array_op_test WHERE i <@ '{38,34,32,89}' ORDER BY seqno;
!  seqno |       i       |                                                             t                                                              
! -------+---------------+----------------------------------------------------------------------------------------------------------------------------
!     40 | {34}          | {AAAAAAAAAAAAAA10611,AAAAAAAAAAAAAAAAAAA1205,AAAAAAAAAAA50956,AAAAAAAAAAAAAAAA31334,AAAAA70466,AAAAAAAA81587,AAAAAAA74623}
!     74 | {32}          | {AAAAAAAAAAAAAAAA1729,AAAAAAAAAAAAA22860,AAAAAA99807,AAAAA17383,AAAAAAAAAAAAAAA67062,AAAAAAAAAAA15165,AAAAAAAAAAA50956}
!     98 | {38,34,32,89} | {AAAAAAAAAAAAAAAAAA71621,AAAA8857,AAAAAAAAAAAAAAAAAAA65037,AAAAAAAAAAAAAAAA31334,AAAAAAAAAA48845}
!    101 | {}            | {}
! (4 rows)
! 
  SELECT * FROM array_op_test WHERE i = '{}' ORDER BY seqno;
   seqno | i  | t  
  -------+----+----
--- 495,501 ----
  (11 rows)
  
  SELECT * FROM array_op_test WHERE i <@ '{38,34,32,89}' ORDER BY seqno;
! ERROR:  compressed data is corrupt
  SELECT * FROM array_op_test WHERE i = '{}' ORDER BY seqno;
   seqno | i  | t  
  -------+----+----
***************
*** 622,632 ****
  (0 rows)
  
  SELECT * FROM array_op_test WHERE i <@ '{}' ORDER BY seqno;
!  seqno | i  | t  
! -------+----+----
!    101 | {} | {}
! (1 row)
! 
  SELECT * FROM array_op_test WHERE i = '{NULL}' ORDER BY seqno;
   seqno |   i    |   t    
  -------+--------+--------
--- 615,621 ----
  (0 rows)
  
  SELECT * FROM array_op_test WHERE i <@ '{}' ORDER BY seqno;
! ERROR:  compressed data is corrupt
  SELECT * FROM array_op_test WHERE i = '{NULL}' ORDER BY seqno;
   seqno |   i    |   t    
  -------+--------+--------
***************
*** 644,654 ****
  (0 rows)
  
  SELECT * FROM array_op_test WHERE i <@ '{NULL}' ORDER BY seqno;
!  seqno | i  | t  
! -------+----+----
!    101 | {} | {}
! (1 row)
! 
  SELECT * FROM array_op_test WHERE t @> '{AAAAAAAA72908}' ORDER BY seqno;
   seqno |           i           |                                                                     t                                                                      
  -------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------
--- 633,639 ----
  (0 rows)
  
  SELECT * FROM array_op_test WHERE i <@ '{NULL}' ORDER BY seqno;
! ERROR:  compressed data is corrupt
  SELECT * FROM array_op_test WHERE t @> '{AAAAAAAA72908}' ORDER BY seqno;
   seqno |           i           |                                                                     t                                                                      
  -------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------

======================================================================

*** /var/lib/jenkins/jobs/postgresql_commitfest_world/workspace/src/test/regress/expected/polymorphism.out	2013-11-13 20:20:40.000000000 +0000
--- /var/lib/jenkins/jobs/postgresql_commitfest_world/workspace/src/test/regress/results/polymorphism.out	2013-12-07 16:21:55.000000000 +0000
***************
*** 638,648 ****
  -- check that we can apply functions taking ANYARRAY to pg_stats
  select distinct array_ndims(histogram_bounds) from pg_stats
  where histogram_bounds is not null;
!  array_ndims 
! -------------
!            1
! (1 row)
! 
  -- such functions must protect themselves if varying element type isn't OK
  -- (WHERE clause here is to avoid possibly getting a collation error instead)
  select max(histogram_bounds) from pg_stats where tablename = 'pg_am';
--- 638,644 ----
  -- check that we can apply functions taking ANYARRAY to pg_stats
  select distinct array_ndims(histogram_bounds) from pg_stats
  where histogram_bounds is not null;
! ERROR:  compressed data is corrupt
  -- such functions must protect themselves if varying element type isn't OK
  -- (WHERE clause here is to avoid possibly getting a collation error instead)
  select max(histogram_bounds) from pg_stats where tablename = 'pg_am';

======================================================================

#51

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Peter Eisentraut (#50)

On Thu, Dec 12, 2013 at 3:43 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

This patch fails the regression tests; see attachment.

Thanks for reporting the diffs. The reason for failures is that
still decoding for tuple is not done as mentioned in Notes section in
mail
(/messages/by-id/CAA4eK1JeUbY16uwrDA2TaBkk+rLRL3Giyyqy1mVh_6CThmDR8w@mail.gmail.com)

However, to keep the sanity of patch, I will do that and post an
updated patch, but I think the main idea behind new approach at this
point is to get feedback on if such an optimization is acceptable
for worst case scenarios and if not whether we can get this done
with table level or GUC option.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Amit Kapila (#51)

On Thu, Dec 12, 2013 at 12:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 12, 2013 at 3:43 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

This patch fails the regression tests; see attachment.

Thanks for reporting the diffs. The reason for failures is that
still decoding for tuple is not done as mentioned in Notes section in
mail
(/messages/by-id/CAA4eK1JeUbY16uwrDA2TaBkk+rLRL3Giyyqy1mVh_6CThmDR8w@mail.gmail.com)

However, to keep the sanity of patch, I will do that and post an
updated patch, but I think the main idea behind new approach at this
point is to get feedback on if such an optimization is acceptable
for worst case scenarios and if not whether we can get this done
with table level or GUC option.

I don't understand why lack of decoding support should cause
regression tests to fail. I thought decoding was only being done
during WAL replay, a case not exercised by the regression tests.

A few other comments:

+#define PGRB_HKEY_PRIME            11     /* prime number used for
rolling hash */
+#define PGRB_HKEY_SQUARE_PRIME            11 * 11     /* prime number
used for rolling hash */
+#define PGRB_HKEY_CUBE_PRIME            11 * 11     * 11 /* prime
number used for rolling hash */

11 * 11 can't accurately be described as a prime number. Nor can 11 *
11 * 11. Please adjust the comment. Also, why 11?

It doesn't appear that pglz_hist_idx is changed except for whitespace;
please revert that hunk.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Robert Haas (#52)

1 attachment(s)

On Thu, Dec 12, 2013 at 8:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 12, 2013 at 12:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 12, 2013 at 3:43 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

This patch fails the regression tests; see attachment.

However, to keep the sanity of patch, I will do that and post an
updated patch, but I think the main idea behind new approach at this
point is to get feedback on if such an optimization is acceptable
for worst case scenarios and if not whether we can get this done
with table level or GUC option.

I don't understand why lack of decoding support should cause
regression tests to fail. I thought decoding was only being done
during WAL replay, a case not exercised by the regression tests.

I had mentioned decoding, as the message in regression failures can come from
decompress/decode functions. Today, I had debugged the regression
failures and
found that they are coming from pglz_decompress() function and the
reason is that
optimizations done in pglz_find_match() to reduce the length of
'maxlen' has problem
due to which compression of data was not happening properly.
I have corrected the calculation for 'maxlen' and now compression
is happening properly
and regression tests are passed.
This problem was observed previously also and we have corrected in
one of the versions
of patch which I forgot to take care while preparing combined patch
of chunk-wise and byte
by byte encoding.

A few other comments:
+#define PGRB_HKEY_PRIME            11     /* prime number used for
rolling hash */
+#define PGRB_HKEY_SQUARE_PRIME            11 * 11     /* prime number
used for rolling hash */
+#define PGRB_HKEY_CUBE_PRIME            11 * 11     * 11 /* prime
number used for rolling hash */
11 * 11 can't accurately be described as a prime number. Nor can 11 *
11 * 11. Please adjust the comment.

Fixed.

Also, why 11?

I have tried with 11,31,101 to see which generates the better chunks
and I found 11 chooses better
chunks (with 31 and 101, there was almost no chunk, it was
considering whole data as one chunk).
The data I have used was the test data of the current test case
which we are using to evaluate this
patch. It contains mostly repetitive data, so might be we need to
test with other kind of data as well
to verify, if currently used number is okay.

It doesn't appear that pglz_hist_idx is changed except for whitespace;
please revert that hunk.

Fixed.

Thanks for review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pgrb_delta_encoding_v2.patchapplication/octet-stream; name=pgrb_delta_encoding_v2.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index e39b977..1a2f8c2 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,14 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/datum.h"
+#include "utils/pg_lzcompress.h"
+
+bool		wal_update_optimization = false;
+bool		rabin_fingerprint_comp = true;
+
+/* guc variable for EWT compression ratio*/
+int			wal_update_compression_ratio = 25;
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +625,58 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples, using pglz. The result is
+ * stored in *encdata. *encdata must point to a PGLZ_header buffer, with at
+ * least PGLZ_MAX_OUTPUT(newtup->t_len) bytes.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	if (rabin_fingerprint_comp)
+		return pgrb_delta_encode(
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+							 encdata, enclen, &strategy
+		);
+	else
+		return pglz_delta_encode(
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+							 encdata, enclen, &strategy
+			);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	return pglz_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+							 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					  oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 91cfae1..74dff50 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,12 +70,12 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
 
-
 static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
@@ -6304,6 +6304,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6314,6 +6320,36 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (/*wal_update_optimization && */wal_update_compression_ratio != 0 &&
+		(oldbuf == newbuf) && !XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6326,6 +6362,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6354,10 +6392,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: LZ header + Encoded data
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7447,7 +7487,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7522,7 +7565,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7631,10 +7674,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Header + Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 56da16a..3d9a401 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2324,6 +2324,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for WAL update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 1c129b8..eba8dba 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -120,7 +120,7 @@
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
  *			kept for the last 4096 input positions, since we cannot use
- *			back-pointers larger than that anyway.  The size of the hash
+ *			back-pointers larger than that anyway.	The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
  *			zero, but reduces the number of hash collisions on long inputs.
@@ -186,6 +186,22 @@
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
 #define PGLZ_HISTORY_SIZE		4096
 #define PGLZ_MAX_MATCH			273
+/*
+ * Popular and effective rolling hash function treats every substring
+ * as a number in some base, the base being usually a prime.
+ * Experiments suggest that prime number 11 generates better chunks.
+ * Currently experiements have been done on repetitive data, more
+ * experiments can be done with different kind of data to choose
+ * more appropriate prime number.
+ */
+#define PGRB_HKEY_PRIME			11	 /* prime number used for rolling hash */
+#define PGRB_HKEY_SQUARE_PRIME			11 * 11	 /* square of prime number used for rolling hash */
+#define PGRB_HKEY_CUBE_PRIME			11 * 11	 * 11 /* cube of prime number used for rolling hash */
+/* number of bits after which to check for constant pattern to form chunk */
+#define PGRB_PATTERN_AFTER_BITS	4
+#define PGRB_CONST_NUM			(1 << PGRB_PATTERN_AFTER_BITS)
+#define PGRB_MIN_CHUNK_SIZE		4
+#define PGRB_MAX_CHUNK_SIZE		4
 
 
 /* ----------
@@ -202,7 +218,7 @@ typedef struct PGLZ_HistEntry
 {
 	struct PGLZ_HistEntry *next;	/* links for my hash key's list */
 	struct PGLZ_HistEntry *prev;
-	int			hindex;			/* my current hash key */
+	uint32		hindex;			/* my current hash key */
 	const char *pos;			/* my input position */
 } PGLZ_HistEntry;
 
@@ -239,12 +255,21 @@ static const PGLZ_Strategy strategy_always_data = {
 const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
 
 
+typedef struct PGRB_HistEntry
+{
+	struct PGRB_HistEntry *next;	/* links for my hash key's list */
+	uint32		hindex;			/* my current hash key */
+	const char *ck_start_pos;	/* chunk start position */
+	int16       ck_size;		/* chunk end position */
+} PGRB_HistEntry;
+
 /* ----------
  * Statically allocated work arrays for history
  * ----------
  */
 static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
 static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+static PGRB_HistEntry rb_hist_entries[PGLZ_HISTORY_SIZE + 1];
 
 /*
  * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
@@ -252,6 +277,7 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  */
 #define INVALID_ENTRY			0
 #define INVALID_ENTRY_PTR		(&hist_entries[INVALID_ENTRY])
+#define RB_INVALID_ENTRY_PTR		(&rb_hist_entries[INVALID_ENTRY])
 
 /* ----------
  * pglz_hist_idx -
@@ -271,6 +297,64 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
 		)
 
+/*
+ * pglz_hash_init and pglz_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pglz_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the (masked) hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pglz_hash_init(_p,hindex,a,b,c,d)									\
+	do {																	\
+			a = 0;															\
+			b = _p[0];														\
+			c = _p[1];														\
+			d = _p[2];														\
+			hindex = (b << 4) ^ (c << 2) ^ d;								\
+	} while (0)
+
+#define pglz_hash_roll(_p,hindex,a,b,c,d,_mask)								\
+	do {																	\
+		/* subtract old a */												\
+		hindex ^= a;														\
+		/* shift and add byte */											\
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = ((hindex << 2) ^ d) & (_mask);								\
+	} while (0)
+
+#define pgrb_hash_init(_p,hindex,a,b,c,d)									\
+	do {																	\
+			a = _p[0];														\
+			b = _p[1];														\
+			c = _p[2];														\
+			d = _p[3];														\
+			hindex = (a * PGRB_HKEY_CUBE_PRIME + b * PGRB_HKEY_SQUARE_PRIME + c * PGRB_HKEY_PRIME + d);						\
+	} while (0)
+
+#define pgrb_hash_roll(_p,hindex,a,b,c,d)								    \
+	do {																	\
+		/* subtract old a, 1000 % 11 = 10 */								\
+		hindex -= (a * PGRB_HKEY_CUBE_PRIME);												\
+		/* add new byte */											        \
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = PGRB_HKEY_PRIME * hindex + d;											\
+	} while (0)
+
+#define pgrb_hist_add_no_recycle(_hs,_he,_hn,_s,_ck_size, _hindex) \
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGRB_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->hindex = _hindex;										\
+			__myhe->ck_start_pos  = (_s);									\
+			__myhe->ck_size  = (_ck_size);									\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
+
 
 /* ----------
  * pglz_hist_add -
@@ -284,10 +368,9 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  * _hn and _recycle are modified in the macro.
  * ----------
  */
-#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _mask)	\
+#define pglz_hist_add(_hs,_he,_hn,_recycle,_s,_e, _hindex)	\
 do {									\
-			int __hindex = pglz_hist_idx((_s),(_e), (_mask));				\
-			int16 *__myhsp = &(_hs)[__hindex];								\
+			int16 *__myhsp = &(_hs)[_hindex];								\
 			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
 			if (_recycle) {													\
 				if (__myhe->prev == NULL)									\
@@ -299,7 +382,7 @@ do {									\
 			}																\
 			__myhe->next = &(_he)[*__myhsp];								\
 			__myhe->prev = NULL;											\
-			__myhe->hindex = __hindex;										\
+			__myhe->hindex = _hindex;										\
 			__myhe->pos  = (_s);											\
 			/* If there was an existing entry in this hash slot, link */	\
 			/* this new entry to it. However, the 0th entry in the */		\
@@ -317,6 +400,22 @@ do {									\
 			}																\
 } while (0)
 
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pglz_hist_add_no_recycle(_hs,_he,_hn,_s,_e, _hindex) \
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGLZ_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->prev = NULL;											\
+			__myhe->hindex = _hindex;										\
+			__myhe->pos  = (_s);											\
+			(_he)[(*__myhsp)].prev = __myhe;								\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
 
 /* ----------
  * pglz_out_ctrl -
@@ -379,6 +478,49 @@ do { \
 
 
 /* ----------
+ * pglz_out_tag_encode -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination/history buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_tag_encode(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off,_from_history) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_from_history)														\
+		_ctrlb |= _ctrl;													\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pglz_out_literal_encode -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pglz_out_literal_encode(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 2;															\
+} while (0)
+
+/* ----------
  * pglz_find_match -
  *
  *		Lookup the history table if the actual input stream matches
@@ -388,17 +530,20 @@ do { \
  */
 static inline int
 pglz_find_match(int16 *hstart, const char *input, const char *end,
-				int *lenp, int *offp, int good_match, int good_drop, int mask)
+				int *lenp, int *offp, int good_match, int good_drop,
+				const char *hend, int hindex)
 {
 	PGLZ_HistEntry *hent;
 	int16		hentno;
 	int32		len = 0;
 	int32		off = 0;
+	bool		history_match = false;
+
 
 	/*
 	 * Traverse the linked history list until a good enough match is found.
 	 */
-	hentno = hstart[pglz_hist_idx(input, end, mask)];
+	hentno = hstart[hindex];
 	hent = &hist_entries[hentno];
 	while (hent != INVALID_ENTRY_PTR)
 	{
@@ -406,11 +551,23 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		const char *hp = hent->pos;
 		int32		thisoff;
 		int32		thislen;
+		int32		maxlen;
+
+		history_match = false;
+		maxlen = PGLZ_MAX_MATCH;
+		if (end - input < maxlen)
+			maxlen = end - input;
+		if (hend && (hend - hp < maxlen))
+			maxlen = hend - hp;
 
 		/*
 		 * Stop if the offset does not fit into our tag anymore.
 		 */
-		thisoff = ip - hp;
+		if (!hend)
+			thisoff = ip - hp;
+		else
+			thisoff = hend - hp;
+
 		if (thisoff >= 0x0fff)
 			break;
 
@@ -430,7 +587,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 				thislen = len;
 				ip += len;
 				hp += len;
-				while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+				while (*ip == *hp && thislen < maxlen)
 				{
 					thislen++;
 					ip++;
@@ -440,7 +597,7 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 		}
 		else
 		{
-			while (ip < end && *ip == *hp && thislen < PGLZ_MAX_MATCH)
+			while (*ip == *hp && thislen < maxlen)
 			{
 				thislen++;
 				ip++;
@@ -488,6 +645,103 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	return 0;
 }
 
+/* ----------
+ * pgrb_find_match -
+ *
+ * ----------
+ */
+static inline int
+pgrb_find_match(int16 *hstart, const char* input_chunk_start,
+				int input_chunk_size, int16 *lenp, int16 *offp,
+				const char *hend, int hindex)
+{
+	PGRB_HistEntry *hent;
+	int16		hentno;
+	const char  *hp;
+	const char  *ip;
+	int16		history_chunk_size;
+	int16		matchlen;
+	bool		match_chunk;
+
+	hentno = hstart[hindex];
+	hent = &rb_hist_entries[hentno];
+
+	while (hent != RB_INVALID_ENTRY_PTR)
+	{
+		/*
+		 * if history and input chunk size doesn't match, then chunks cannot
+		 * match.
+		 */
+		history_chunk_size = hent->ck_size;
+		if (history_chunk_size != input_chunk_size)
+			return 0;
+
+		match_chunk = true;
+		matchlen = history_chunk_size;
+		hp = hent->ck_start_pos;
+		ip = input_chunk_start;
+
+		/*
+		 * first try to match uptil chunksize and if the data is
+		 * same for chunk, then try to match further to get the
+		 * larger match. if there is a match at end of chunk, it
+		 * is possible that further bytes in string will match.
+		 */
+		while (history_chunk_size-- > 0)
+		{
+			if (*hp++ != *ip++)
+				match_chunk = false;
+			else
+				match_chunk = true;
+		}
+
+		if (match_chunk)
+		{
+			while (*ip == *hp)
+			{
+				matchlen++;
+				ip++;
+				hp++;
+			}
+		}
+		else
+		{
+			hent = hent->next;
+			continue;
+		}
+
+		*offp = hend - hent->ck_start_pos;
+		*lenp = matchlen;
+
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
 
 /* ----------
  * pglz_compress -
@@ -574,11 +828,14 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	else
 		result_max = (slen * (100 - need_rate)) / 100;
 
+	hashsz = choose_hash_size(slen);
+	mask = hashsz - 1;
+
 	/*
 	 * Experiments suggest that these hash sizes work pretty well. A large
-	 * hash table minimizes collision, but has a higher startup cost. For
-	 * a small input, the startup cost dominates. The table size must be
-	 * a power of two.
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
 	 */
 	if (slen < 128)
 		hashsz = 512;
@@ -603,6 +860,8 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	 */
 	while (dp < dend)
 	{
+		uint32		hindex;
+
 		/*
 		 * If we already exceeded the maximum result size, fail.
 		 *
@@ -625,8 +884,9 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 		/*
 		 * Try to find a match in the history
 		 */
+		hindex = pglz_hist_idx(dp, dend, mask);
 		if (pglz_find_match(hist_start, dp, dend, &match_len,
-							&match_off, good_match, good_drop, mask))
+							&match_off, good_match, good_drop, NULL, hindex))
 		{
 			/*
 			 * Create the tag and add history entries for all matched
@@ -635,9 +895,10 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
 			while (match_len--)
 			{
+				hindex = pglz_hist_idx(dp, dend, mask);
 				pglz_hist_add(hist_start, hist_entries,
 							  hist_next, hist_recycle,
-							  dp, dend, mask);
+							  dp, dend, hindex);
 				dp++;			/* Do not do this ++ in the line above! */
 				/* The macro would do it four times - Jan.	*/
 			}
@@ -651,7 +912,7 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
 			pglz_hist_add(hist_start, hist_entries,
 						  hist_next, hist_recycle,
-						  dp, dend, mask);
+						  dp, dend, hindex);
 			dp++;				/* Do not do this ++ in the line above! */
 			/* The macro would do it four times - Jan.	*/
 		}
@@ -674,6 +935,441 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
+/*
+ * Delta encoding.
+ *
+ * The 'source' is encoded using the same pglz algorithm used for compression.
+ * The difference with pglz_compress is that the back-references refer to
+ * the 'history', instead of earlier offsets in 'source'.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int32		match_len = 0;
+	int32		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		good_match;
+	int32		good_drop;
+	int32		need_rate;
+	int			hist_next = 0;
+	int			hashsz;
+	int			mask;
+	int32		a,
+				b,
+				c,
+				d;
+	int32		hindex;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGLZ_MAX_MATCH)
+		good_match = PGLZ_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	pglz_hash_init(hp, hindex, a, b, c, d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * TODO: It would be nice to behave like the history and the source
+		 * strings were concatenated, so that you could compress using the new
+		 * data, too.
+		 */
+		pglz_hash_roll(hp, hindex, a, b, c, d, mask);
+		pglz_hist_add_no_recycle(hist_start, hist_entries,
+								 hist_next,
+								 hp, hend, hindex);
+		hp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	pglz_hash_init(dp, hindex, a, b, c, d);
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pglz_hash_roll(dp, hindex, a, b, c, d, mask);
+		if (pglz_find_match(hist_start, dp, dend, &match_len,
+							&match_off, good_match, good_drop, hend, hindex))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			dp += match_len;
+			found_match = true;
+			pglz_hash_init(dp, hindex, a, b, c, d);
+		}
+		else
+		{
+			/*
+			 * No match found. Copy one literal byte.
+			 */
+			pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;				/* Do not do this ++ in the line above! */
+			/* The macro would do it four times - Jan.	*/
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/*
+ * Rabin's Delta encoding.
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dp_chunk_start;
+	const char *dend = source + slen;
+	const char *dp_unmatched_chunk_start;
+	const char *hp = history;
+	const char *hp_chunk_start;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	bool		unmatched_data = false;
+	int16		match_len = 0;
+	int16		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		need_rate;
+	int			hist_next = 1;
+	int			hashsz;
+	int			mask;
+	int32		a,
+				b,
+				c,
+				d;
+	int32		hindex;
+	int16		len = 0;
+	int16		dp_chunk_size = 0;
+	int16		hp_chunk_size = 0;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	/* hashsz = choose_hash_size(hlen + slen); */
+	hashsz = choose_hash_size(hlen/PGRB_MIN_CHUNK_SIZE);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	hp_chunk_start = hp;
+	pgrb_hash_init(hp, hindex, a, b, c, d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * if we found the special pattern or reached max chunk size,
+		 * then consider it as a chunk and add the same to history
+		 * table.
+		 */
+		if ((hp_chunk_size >= PGRB_MIN_CHUNK_SIZE &&
+			(hindex % PGRB_CONST_NUM) == 0) ||
+			hp_chunk_size == PGRB_MAX_CHUNK_SIZE)
+		{
+			pgrb_hist_add_no_recycle(hist_start, rb_hist_entries,
+									 hist_next, hp_chunk_start,
+									 hp_chunk_size, (hindex & mask));
+			hp++;
+			hp_chunk_start = hp;
+			hp_chunk_size = 0;
+		}
+		else
+		{
+			hp++;					/* Do not do this ++ in the line above! */
+			hp_chunk_size++;
+		}
+		pgrb_hash_roll(hp, hindex, a, b, c, d);
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	dp_chunk_start = dp;
+	dp_unmatched_chunk_start = dp;
+	pgrb_hash_init(dp, hindex, a, b, c, d);
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (found_match)
+		{
+			if (bp - bstart >= result_max)
+				return false;
+		}
+		else
+		{
+			if (len >= result_max)
+				return false;
+		}
+
+		/*
+		 * Try to find a match in the history
+		 */
+		if ((dp_chunk_size >= PGRB_MIN_CHUNK_SIZE &&
+			(hindex % PGRB_CONST_NUM) == 0) ||
+			dp_chunk_size == PGRB_MAX_CHUNK_SIZE)
+		{
+			if (pgrb_find_match(hist_start, dp_chunk_start,
+								dp_chunk_size, &match_len, &match_off,
+								hend, (hindex & mask)))
+			{
+				/*
+				 * Create the tag and add history entries for all matched
+				 * characters and ensure to copy any unmatched data till
+				 * this point. Currently this code only delays copy of
+				 * unmatched data in begining.
+				 */
+				if (unmatched_data)
+				{
+					while (dp_unmatched_chunk_start < dp)
+					{
+						   pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_unmatched_chunk_start);
+						   dp_unmatched_chunk_start++;
+					}
+					unmatched_data = false;
+				}
+				pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+				found_match = true;
+				if (match_len > dp_chunk_size)
+					dp += match_len - dp_chunk_size;
+			}
+			else
+			{
+				/*
+				 * No match found, add chunk to history table and
+				 * copy chunk into destination buffer.
+				 */
+				if (found_match)
+				{
+					while (dp_chunk_start < dp)
+					{
+						   pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_chunk_start);
+						   dp_chunk_start++;
+					}
+					/* The macro would do it four times - Jan.	*/
+				}
+				else
+					unmatched_data = true;
+			}
+			len++;
+			dp++;				/* Do not do this ++ in the line above! */
+			dp_chunk_start = dp;
+			dp_chunk_size = 0;
+		}
+		else
+		{
+			dp_chunk_size++;
+			len++;
+			dp++;
+		}
+		pgrb_hash_roll(dp, hindex, a, b, c, d);
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
 
 /* ----------
  * pglz_decompress -
@@ -777,3 +1473,124 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pglz_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pglz_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc += 2)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT.
+				 */
+				if ((ctrl >> 1) & 1)
+				{
+					memcpy(dp, hend - off, len);
+					dp += len;
+				}
+				else
+				{
+					/*
+					 * Now we copy the bytes specified by the tag from OUTPUT
+					 * to OUTPUT. It is dangerous and platform dependent to
+					 * use memcpy() here, because the copied areas could
+					 * overlap extremely!
+					 */
+					while (len--)
+					{
+						*dp = dp[-off];
+						dp++;
+					}
+				}
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 2;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b0c14a2..7d5fa73 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -765,6 +765,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"wal_update_optimization", PGC_USERSET, WAL_SETTINGS,
+			gettext_noop("Enables the WAL optimizations for update operation."),
+			NULL
+		},
+		&wal_update_optimization,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
@@ -2482,6 +2491,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index d049159..0869a3b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -192,6 +192,9 @@
 #commit_delay = 0			# range 0-100000, in microseconds
 #commit_siblings = 5			# range 1-1000
 
+#wal_update_optimization = off		# Enables the WAL optimization for update
+					# operations
+
 # - Checkpoints -
 
 #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 438e79d..6aa0145 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index 0a832e9..830349b 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -689,6 +689,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e69accd..d9763d3 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -278,6 +278,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 98ca553..0128f69 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -226,6 +226,9 @@ extern bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 
+extern bool wal_update_optimization;
+extern int	wal_update_compression_ratio;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..09a2214 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,15 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pglz_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#54

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Amit Kapila (#49)

1 attachment(s)

On Fri, Dec 6, 2013 at 6:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Agreed, summarization of data for LZ/Chunkwise encoding especially for
non-compressible (hundred tiny fields, all changed/half changed) or less
compressible data (hundred tiny fields, half nulled) w.r.t CPU
usage is as below:

a. For hard disk, there is an overhead of 7~16% with LZ delta encoding
and there is an overhead of 5~8% with Chunk wise encoding.

b. For Tempfs (which means operate on RAM as disk), there is an
overhead of 19~26%
with LZ delta encoding and there is an overhead of 9~18% with
Chunk wise encoding.

There might be some variation of data (in your last mail the overhead
for chunkwise method for Tempfs was < 12%),
but in general the data suggests that chunk wise encoding has less
overhead than LZ encoding for non-compressible data
and for others it is better or equal.

Now, I think we have below options for this patch:
a. If the performance overhead for worst case is acceptable (we can
try to reduce some more, but don't think it will be something
drastic),
then this can be done without any flag/option.
b. Have it with table level option to enable/disable WAL compression
c. Drop this patch, as for worst cases there is some performance overhead.
d. Go back and work more on it, if there is any further suggestions
for improvement.

Based on data posted previously for both approaches
(lz_delta, chunk_wise_encoding) and above options, I have improved
the last version of patch by keeping chunk wise approach and
provided a table level option to user.

Changes in this version of patch:
--------------------------------------------------
1. Implement decoding, it is almost similar to pglz_decompress as
the format to store encoded data is not changed much.

2. Provide a new reloption to specify Wal compression
for update operation on table
Create table tbl(c1 char(100)) With (compress_wal = true);

Alternative options:
a. compress_wal can take input as operation, e.g. 'insert', 'update',
b. use alternate syntax:
Create table tbl(c1 char(100)) Compress Wal For Update;
c. anything better?

3. Fixed below 2 defects in encoding:
a. In function pgrb_find_match(), if last byte of chunk matches,
it consider whole chunk as match.
b. If there is no match, it copies chunk as it is to encoded data,
while copying, it is ignoring last byte.
Due to defect fixes, data can vary, but I don't think there can be
any major change.

Points to consider
-----------------------------

1. As the current algorithm store the entry for same chunks at head of list,
it will always find last but one chunk (we don't store last 4 bytes) for
long matching string during match phase in encoding (pgrb_delta_encode).

We can improve it either by storing same chunks at end of list instead of at
head or by trying to find a good_match technique used in lz algorithm.
Finding good_match technique can have overhead in some of the cases
when there is actually no match.

2. Another optimization that we can do in pgrb_find_match(), is that
currently if
it doesn't find the first chunk (chunk got by hash index) matching, it
continues to find the match in other chunks. I am not sure if there is any
benefit to search for other chunks if first one is not matching.

3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c,
if we want to move, then we need to either duplicate some common macros
like pglz_out_tag or keep it common, but might be change the name.

4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively).
The point to consider is that if we keep bigger chunk sizes, then it can
save us on CPU cycles, but less reduction in Wal, on the other side if we
keep it small it can have better reduction in Wal but consume more CPU
cycles.

5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we
can remove it before commit.

7. docs needs to be updated, tab completion needs some work.

8. We can extend Alter Table to set compress option for table.

Thoughts/Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pgrb_delta_encoding_v3.patchapplication/octet-stream; name=pgrb_delta_encoding_v3.patchDownload

diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index aea9d40..5619825 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,10 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_lzcompress.h"
+
+/* guc variable for EWT compression ratio. */
+int			wal_update_compression_ratio = 25;
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +621,49 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	PGLZ_Strategy strategy;
+
+	strategy = *PGLZ_strategy_default;
+	strategy.min_comp_rate = wal_update_compression_ratio;
+
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, &strategy
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..0e9efca 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"compress_wal",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"compress_wal", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, compress_wal)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 05c790f..cf6f1fb 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_lzcompress.h"
 
 
 /* GUC variable */
@@ -6591,6 +6592,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6601,6 +6608,38 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		wal_update_compression_ratio != 0 &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6613,6 +6652,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6628,7 +6669,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6641,10 +6682,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7733,7 +7777,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7808,7 +7855,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7917,10 +7964,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3cde91e..05570bd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2324,6 +2324,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 30f1c0a..ab90940 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -186,6 +186,22 @@
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
 #define PGLZ_HISTORY_SIZE		4096
 #define PGLZ_MAX_MATCH			273
+/*
+ * Popular and effective rolling hash function treats every substring
+ * as a number in some base, the base being usually a prime.
+ * Experiments suggest that prime number 11 generates better chunks.
+ * Currently experiements have been done on repetitive data, more
+ * experiments can be done with different kind of data to choose
+ * more appropriate prime number.
+ */
+#define PGRB_HKEY_PRIME			11	 /* prime number used for rolling hash */
+#define PGRB_HKEY_SQUARE_PRIME			11 * 11	 /* square of prime number used for rolling hash */
+#define PGRB_HKEY_CUBE_PRIME			11 * 11	 * 11 /* cube of prime number used for rolling hash */
+/* number of bits after which to check for constant pattern to form chunk */
+#define PGRB_PATTERN_AFTER_BITS	4
+#define PGRB_CONST_NUM			(1 << PGRB_PATTERN_AFTER_BITS)
+#define PGRB_MIN_CHUNK_SIZE		2
+#define PGRB_MAX_CHUNK_SIZE		4
 
 
 /* ----------
@@ -239,12 +255,21 @@ static const PGLZ_Strategy strategy_always_data = {
 const PGLZ_Strategy *const PGLZ_strategy_always = &strategy_always_data;
 
 
+typedef struct PGRB_HistEntry
+{
+	struct PGRB_HistEntry *next;	/* links for my hash key's list */
+	uint32		hindex;			/* my current hash key */
+	const char *ck_start_pos;	/* chunk start position */
+	int16       ck_size;		/* chunk end position */
+} PGRB_HistEntry;
+
 /* ----------
  * Statically allocated work arrays for history
  * ----------
  */
 static int16 hist_start[PGLZ_MAX_HISTORY_LISTS];
 static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
+static PGRB_HistEntry rb_hist_entries[PGLZ_HISTORY_SIZE + 1];
 
 /*
  * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
@@ -252,6 +277,7 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
  */
 #define INVALID_ENTRY			0
 #define INVALID_ENTRY_PTR		(&hist_entries[INVALID_ENTRY])
+#define RB_INVALID_ENTRY_PTR		(&rb_hist_entries[INVALID_ENTRY])
 
 /* ----------
  * pglz_hist_idx -
@@ -271,6 +297,49 @@ static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];
 			  ((_s)[2] << 2) ^ (_s)[3])) & (_mask)				\
 		)
 
+/*
+ * pgrb_hash_init and pgrb_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pgrb_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pgrb_hash_init(_p,hindex,a,b,c,d)									\
+	do {																	\
+			a = _p[0];														\
+			b = _p[1];														\
+			c = _p[2];														\
+			d = _p[3];														\
+			hindex = (a * PGRB_HKEY_CUBE_PRIME + b * PGRB_HKEY_SQUARE_PRIME + c * PGRB_HKEY_PRIME + d);						\
+	} while (0)
+
+#define pgrb_hash_roll(_p,hindex,a,b,c,d)								    \
+	do {																	\
+		/* subtract old a, 1000 % 11 = 10 */								\
+		hindex -= (a * PGRB_HKEY_CUBE_PRIME);												\
+		/* add new byte */											        \
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = PGRB_HKEY_PRIME * hindex + d;											\
+	} while (0)
+
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGLZ_HISTORY_SIZE.
+ */
+#define pgrb_hist_add_no_recycle(_hs,_he,_hn,_s,_ck_size, _hindex) \
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGRB_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->hindex = _hindex;										\
+			__myhe->ck_start_pos  = (_s);									\
+			__myhe->ck_size  = (_ck_size);									\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
 
 /* ----------
  * pglz_hist_add -
@@ -488,6 +557,107 @@ pglz_find_match(int16 *hstart, const char *input, const char *end,
 	return 0;
 }
 
+/* ----------
+ * pgrb_find_match -
+ *
+ * ----------
+ */
+static inline int
+pgrb_find_match(int16 *hstart, const char* input_chunk_start,
+				int input_chunk_size, int16 *lenp, int16 *offp,
+				const char *hend, int hindex)
+{
+	PGRB_HistEntry *hent;
+	int16		hentno;
+	const char  *hp;
+	const char  *ip;
+	int16		history_chunk_size;
+	int16		matchlen;
+	bool		match_chunk;
+
+	hentno = hstart[hindex];
+	hent = &rb_hist_entries[hentno];
+
+	while (hent != RB_INVALID_ENTRY_PTR)
+	{
+		/*
+		 * if history and input chunk size doesn't match, then chunks cannot
+		 * match.
+		 */
+		history_chunk_size = hent->ck_size;
+		if (history_chunk_size != input_chunk_size)
+			return 0;
+
+		match_chunk = true;
+		matchlen = history_chunk_size;
+		hp = hent->ck_start_pos;
+		ip = input_chunk_start;
+
+		/*
+		 * first try to match uptil chunksize and if the data is
+		 * same for chunk, then try to match further to get the
+		 * larger match. if there is a match at end of chunk, it
+		 * is possible that further bytes in string will match.
+		 */
+		while (history_chunk_size-- > 0)
+		{
+			if (*hp++ != *ip++)
+			{
+				match_chunk = false;
+				break;
+			}
+			else
+				match_chunk = true;
+		}
+
+		if (match_chunk)
+		{
+			while (*ip == *hp)
+			{
+				matchlen++;
+				ip++;
+				hp++;
+			}
+		}
+		else
+		{
+			hent = hent->next;
+			continue;
+		}
+
+		*offp = hend - hent->ck_start_pos;
+		*lenp = matchlen;
+
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
+
 
 /* ----------
  * pglz_compress -
@@ -676,6 +846,259 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 
 
 /* ----------
+ * Rabin's Delta Encoding -
+ *
+ * The 'source' is encoded using rabin finger print method.
+ * We use a rolling hash function to divide up the history data into chunks
+ * of a given average size.  To do that we scan the history data, compute a
+ * rolling hash value at each byte, and each time the bottom
+ * PGRB_PATTERN_AFTER_BITS bits are zero, we consider that to be the end of
+ * a chunk. To make the chunk size more predictable and handle worst case
+ * (the data doesn't contain special pattern) we use min and max chunk
+ * boundaries. Enter all the chunks into a hash table.
+ * Then, we scan the input we want to compress and divide it into chunks in
+ * the same way.  Chunks that don't exist in the history data get copied to
+ * the output, while those that do get replaced with a reference to their
+ * position in the history data.
+ * The encoding format to store encoded string is same as pglz.
+ *
+ * The encoded result is written to *dest, and its length is returned in
+ * *finallen.
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGLZ_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dp_chunk_start;
+	const char *dend = source + slen;
+	const char *dp_unmatched_chunk_start;
+	const char *hp = history;
+	const char *hp_chunk_start;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	bool		unmatched_data = false;
+	int16		match_len = 0;
+	int16		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		need_rate;
+	int			hist_next = 1;
+	int			hashsz;
+	int			mask;
+	int32		a,
+				b,
+				c,
+				d;
+	int32		hindex;
+	int16		len = 0;
+	int16		dp_chunk_size = 1;
+	int16		hp_chunk_size = 1;
+
+	/*
+	 * Tuples of length greater than PGLZ_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGLZ_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGLZ_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen/PGRB_MIN_CHUNK_SIZE);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	hp_chunk_start = hp;
+	pgrb_hash_init(hp, hindex, a, b, c, d);
+	while (hp < hend - 4)
+	{
+		/*
+		 * if we found the special pattern or reached max chunk size,
+		 * then consider it as a chunk and add the same to history
+		 * table.
+		 */
+		if ((hp_chunk_size >= PGRB_MIN_CHUNK_SIZE &&
+			(hindex % PGRB_CONST_NUM) == 0) ||
+			hp_chunk_size == PGRB_MAX_CHUNK_SIZE)
+		{
+			pgrb_hist_add_no_recycle(hist_start, rb_hist_entries,
+									 hist_next, hp_chunk_start,
+									 hp_chunk_size, (hindex & mask));
+			hp++;
+			hp_chunk_start = hp;
+			hp_chunk_size = 1;
+		}
+		else
+		{
+			hp++;					/* Do not do this ++ in the line above! */
+			hp_chunk_size++;
+		}
+		pgrb_hash_roll(hp, hindex, a, b, c, d);
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	dp_chunk_start = dp;
+	dp_unmatched_chunk_start = dp;
+	pgrb_hash_init(dp, hindex, a, b, c, d);
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (found_match)
+		{
+			if (bp - bstart >= result_max)
+				return false;
+		}
+		else
+		{
+			if (len >= result_max)
+				return false;
+		}
+
+		/*
+		 * Try to find a match in the history
+		 */
+		if ((dp_chunk_size >= PGRB_MIN_CHUNK_SIZE &&
+			(hindex % PGRB_CONST_NUM) == 0) ||
+			dp_chunk_size == PGRB_MAX_CHUNK_SIZE)
+		{
+			if (pgrb_find_match(hist_start, dp_chunk_start,
+								dp_chunk_size, &match_len, &match_off,
+								hend, (hindex & mask)))
+			{
+				/*
+				 * Create the tag and add history entries for all matched
+				 * characters and ensure to copy any unmatched data till
+				 * this point. Currently this code only delays copy of
+				 * unmatched data in begining.
+				 */
+				if (unmatched_data)
+				{
+					while (dp_unmatched_chunk_start <= dp)
+					{
+						   pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_unmatched_chunk_start);
+						   dp_unmatched_chunk_start++;
+					}
+					unmatched_data = false;
+				}
+				pglz_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+				found_match = true;
+				if (match_len > dp_chunk_size)
+					dp += match_len - dp_chunk_size;
+			}
+			else
+			{
+				/* No match found, copy chunk into destination buffer. */
+				if (found_match)
+				{
+					while (dp_chunk_start <= dp)
+					{
+						   pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_chunk_start);
+						   dp_chunk_start++;
+					}
+					/* The macro would do it four times - Jan.	*/
+				}
+				else
+					unmatched_data = true;
+			}
+			len++;
+			dp++;				/* Do not do this ++ in the line above! */
+			dp_chunk_start = dp;
+			dp_chunk_size = 1;
+		}
+		else
+		{
+			dp_chunk_size++;
+			len++;
+			dp++;
+		}
+		pgrb_hash_roll(dp, hindex, a, b, c, d);
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pglz_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
  * pglz_decompress -
  *
  *		Decompresses source into dest.
@@ -777,3 +1200,107 @@ pglz_decompress(const PGLZ_Header *source, char *dest)
 	 * That's it.
 	 */
 }
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1217098..d58b24d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2486,6 +2486,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		/* Not for general use */
+		{"wal_update_compression_ratio", PGC_USERSET, DEVELOPER_OPTIONS,
+			gettext_noop("Sets the compression ratio of delta record for wal update"),
+			NULL,
+		},
+		&wal_update_compression_ratio,
+		25, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 017e74d..0aae944 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -278,6 +278,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index b145a19..eaffe64 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -226,6 +226,8 @@ extern bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 
+extern int	wal_update_compression_ratio;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/utils/pg_lzcompress.h b/src/include/utils/pg_lzcompress.h
index 4af24a3..3797b6c 100644
--- a/src/include/utils/pg_lzcompress.h
+++ b/src/include/utils/pg_lzcompress.h
@@ -107,6 +107,15 @@ extern const PGLZ_Strategy *const PGLZ_strategy_always;
  */
 extern bool pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 			  const PGLZ_Strategy *strategy);
+extern bool pglz_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGLZ_Strategy *strategy);
 extern void pglz_decompress(const PGLZ_Header *source, char *dest);
+extern void pgrb_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
 
 #endif   /* _PG_LZCOMPRESS_H_ */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..4c510ba 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		compress_wal;			/* compress wal tuple update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->compress_wal : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#55

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Amit Kapila (#54)

2. Provide a new reloption to specify Wal compression
for update operation on table
Create table tbl(c1 char(100)) With (compress_wal = true);

Alternative options:
a. compress_wal can take input as operation, e.g. 'insert', 'update',
b. use alternate syntax:
Create table tbl(c1 char(100)) Compress Wal For Update;
c. anything better?

I think WITH (compress_wal = true) is pretty good. I don't understand
your point about taking the operation as input, because this only
applies to updates. But we could try to work "update" into the name
of the setting somehow, so as to be less likely to conflict with
future additions, like maybe wal_compress_update. I think the
alternate syntax you propose is clearly worse, because it would
involve adding new keywords, something we try to avoid.

The only possible enhancement I can think of here is to make the
setting an integer rather than a Boolean, defined as the minimum
acceptable compression ratio. A setting of 0 means always compress; a
setting of 100 means never compress; intermediate values define the
least acceptable ratio. But to be honest, I think that's overkill;
I'd be inclined to hard-code the default value of 25 in the patch and
make it a #define. The only real advantage of requiring a minimum 25%
compression percentage is that we can bail out on compression
three-quarters of the way through the tuple if we're getting nowhere.
That's fine for what it is, but the idea that users are going to see
much benefit from twaddling that number seems very dubious to me.

Points to consider
-----------------------------

1. As the current algorithm store the entry for same chunks at head of list,
it will always find last but one chunk (we don't store last 4 bytes) for
long matching string during match phase in encoding (pgrb_delta_encode).

We can improve it either by storing same chunks at end of list instead of at
head or by trying to find a good_match technique used in lz algorithm.
Finding good_match technique can have overhead in some of the cases
when there is actually no match.

I don't see what the good_match thing has to do with anything in the
Rabin algorithm. But I do think there might be a bug here, which is
that, unless I'm misinterpreting something, hp is NOT the end of the
chunk. After calling pgrb_hash_init(), we've looked at the first FOUR
bytes of the input. If we find that we have a zero hash value at that
point, shouldn't the chunk size be 4, not 1? And similarly if we find
it after sucking in one more byte, shouldn't the chunk size be 5, not
2? Right now, we're deciding where the chunks should end based on the
data in the chunk plus the following 3 bytes, and that seems wonky. I
would expect us to include all of those bytes in the chunk.

2. Another optimization that we can do in pgrb_find_match(), is that
currently if
it doesn't find the first chunk (chunk got by hash index) matching, it
continues to find the match in other chunks. I am not sure if there is any
benefit to search for other chunks if first one is not matching.

Well, if you took that out, I suspect it would hurt the compression
ratio. Unless the CPU savings are substantial, I'd leave it alone.

3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c,
if we want to move, then we need to either duplicate some common macros
like pglz_out_tag or keep it common, but might be change the name.

+1 for a new file.

4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively).
The point to consider is that if we keep bigger chunk sizes, then it can
save us on CPU cycles, but less reduction in Wal, on the other side if we
keep it small it can have better reduction in Wal but consume more CPU
cycles.

Whoa. That seems way too small. Since PGRB_PATTERN_AFTER_BITS is 4,
the average length of a chunk is about 16 bytes. It makes little
sense to have the maximum chunk size be 25% of the expected chunk
length. I'd recommend making the maximum chunk length something like
4 * PGRB_CONST_NUM, and the minimum chunk length maybe something like
4.

5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we
can remove it before commit.

Let's remove it now.

7. docs needs to be updated, tab completion needs some work.

Tab completion can be skipped for now, but documentation is important.

8. We can extend Alter Table to set compress option for table.

I don't understand what you have in mind here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Amit Kapila

amit.kapila16@gmail.com

about 12 years ago

In reply to: Robert Haas (#55)

On Fri, Jan 10, 2014 at 9:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:

2. Provide a new reloption to specify Wal compression
for update operation on table
Create table tbl(c1 char(100)) With (compress_wal = true);

Alternative options:
a. compress_wal can take input as operation, e.g. 'insert', 'update',
b. use alternate syntax:
Create table tbl(c1 char(100)) Compress Wal For Update;
c. anything better?

I think WITH (compress_wal = true) is pretty good. I don't understand
your point about taking the operation as input, because this only
applies to updates.

Yes, currently this applies to update, what I have in mind is that
in future if some one wants to use WAL compression for any other
operation like 'full_page_writes', then it can be easily extendible.

To be honest, I have not evaluated whether such a flag or compression
would make sense for full page writes, but I think it should be possible
while doing full page write (BkpBlock has RelFileNode) to check such a
flag if it's present.

But we could try to work "update" into the name
of the setting somehow, so as to be less likely to conflict with
future additions, like maybe wal_compress_update. I think the
alternate syntax you propose is clearly worse, because it would
involve adding new keywords, something we try to avoid.

Yes, this would be better than current, I will include it in
next version of patch, unless there is any other better idea than
this one.

Points to consider
-----------------------------

1. As the current algorithm store the entry for same chunks at head of list,
it will always find last but one chunk (we don't store last 4 bytes) for
long matching string during match phase in encoding (pgrb_delta_encode).

We can improve it either by storing same chunks at end of list instead of at
head or by trying to find a good_match technique used in lz algorithm.
Finding good_match technique can have overhead in some of the cases
when there is actually no match.

I don't see what the good_match thing has to do with anything in the
Rabin algorithm.

The case for which I have mentioned it is where most of the data is
repetitive and modified tuple is almost same.

For example

orignal tuple
aaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

modified tuple
ccaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

Now let us see what will happen as per current algorithm

Step -1: Form the hash table (lets consider 4 byte chunks just for purpose
of explanation):
a. First chunk after first 4 a's (aaaa) with hash value 1024.
b. After that all the chunks will be 4 b's (bbbb) and will have
same hash value (lets call it H2), so will be mapped to same
bucket (let us call it 'B2' bucket) and hence form a list.
Every new chunk with same hash value is added to front of list,
so if go by this B2 has front element with hash value as H2
and location as 58 (last but 8 bytes, as we don't include last
4 bytes in our algorithm)

Step -2: Perform encoding for modified tuple by using hash table formed in
Step-1
a. First chunk after first 4 bytes (ccaa) with hash value 1056.
b. try to find match in hash table, no match, so proceed.
c. Next chunk with 4 b's and hash value H2, try to find a match,
it will find match in B2 bucket (at first element).
d. okay hash value matched, so it will try to match each byte.
if all bytes matched, it will consider chunk as matched_chunk.
e. Now if the chunk matches, then we try to match consecutive bytes
after chunk (in the hope of finding longer match), and in this
case, it will find a match of 8 bytes and consider match_len as 8.
f. it will increment the modified tuple (dp) to point to byte
next to match_len.
g. Again, it will consider next 4 b's as chunk and repeat step
c~f.

So here, what is happening is steps c~f are getting repeated, whereas if
we would have added same chunks at end of list in step 1b, then we
could have found the matching string just in one go (c~f).

The reason of adding the same chunk in head of list is that it uses same
technique as pglz_hist_add. Now in pglz, it will not have repeat steps
from c~f, as it has concept of good_match which leads to get this done in
one go.

Being said above, I am really not sure, how much real world data falls
in above category and should we try to optimize based on above example,
but yes it will save some CPU cycles in current test we are using.

But I do think there might be a bug here, which is
that, unless I'm misinterpreting something, hp is NOT the end of the
chunk. After calling pgrb_hash_init(), we've looked at the first FOUR
bytes of the input. If we find that we have a zero hash value at that
point, shouldn't the chunk size be 4, not 1? And similarly if we find
it after sucking in one more byte, shouldn't the chunk size be 5, not
2? Right now, we're deciding where the chunks should end based on the
data in the chunk plus the following 3 bytes, and that seems wonky. I
would expect us to include all of those bytes in the chunk.

It depends on how we define chunk, basically chunk size will be based
on the byte for which we consider hindex. The hindex for any byte is
calculated considering that byte and the following 3 bytes, so
after calling pgrb_hash_init(), even though we have looked at 4 bytes
but still the hindex is for first byte and thats why it consider
chunk size as 1, not 4.

Isn't it similar to how current pglz works, basically it also
uses next 4 bytes to calculate index (pglz_hist_idx) but still
does byte by byte comparison, here if we try to map to rabin's
delta encoding then always chunk size is 1.

If we follow the same logic to define chunk both for encoding and match,
will there be any problem?

I have tried to keep the implementation closer to previous lz delta
encoding, but if you see benefit in including the supporting bytes
(next 3 bytes) to define a chunk, then I can try to change it.

2. Another optimization that we can do in pgrb_find_match(), is that
currently if
it doesn't find the first chunk (chunk got by hash index) matching, it
continues to find the match in other chunks. I am not sure if there is any
benefit to search for other chunks if first one is not matching.

Well, if you took that out, I suspect it would hurt the compression
ratio.

True, this is the reason I have kept it, but was not sure what kind
of scenarios it can benefit and whether such scenarios can be
more common for updates.

Unless the CPU savings are substantial, I'd leave it alone.

Okay, lets leave it as it is.

3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c,
if we want to move, then we need to either duplicate some common macros
like pglz_out_tag or keep it common, but might be change the name.

+1 for a new file.

Okay, will take care of it in next version.

4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively).
The point to consider is that if we keep bigger chunk sizes, then it can
save us on CPU cycles, but less reduction in Wal, on the other side if we
keep it small it can have better reduction in Wal but consume more CPU
cycles.

Whoa. That seems way too small. Since PGRB_PATTERN_AFTER_BITS is 4,
the average length of a chunk is about 16 bytes. It makes little
sense to have the maximum chunk size be 25% of the expected chunk
length. I'd recommend making the maximum chunk length something like
4 * PGRB_CONST_NUM, and the minimum chunk length maybe something like
4.

Agreed, but I think for some strings (where it doesn't find special bit
pattern) it will create long chunks which can effect the reduction.
AFAIR, in the current test which we are using to evaluate the
performance of this patch has strings where it will do so, but on
the other side, this test might not be the case for real update strings.

I will make these modifications and report here if it affects the results.

5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we
can remove it before commit.

Let's remove it now.

Sure, will remove in next version.

7. docs needs to be updated, tab completion needs some work.

Tab completion can be skipped for now, but documentation is important.

Agreed, left it for this version of patch, so that we can conclude on syntax
and then accordingly, I can mention it in docs.

8. We can extend Alter Table to set compress option for table.

I don't understand what you have in mind here.

Let us say user created table without this new option (compress_wal)
and later wants to enable it, so we can provide similar to what we
provide for other storage parameters.
Alter Table Set (compress_wal=true)

One more point to note that in the version of patch I posted, it has
default value for compress_wal as true, just for easiness of test,
we might want to change it to false for backeward compatibility.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#56)

On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yes, currently this applies to update, what I have in mind is that
in future if some one wants to use WAL compression for any other
operation like 'full_page_writes', then it can be easily extendible.

To be honest, I have not evaluated whether such a flag or compression
would make sense for full page writes, but I think it should be possible
while doing full page write (BkpBlock has RelFileNode) to check such a
flag if it's present.

Makes sense.

The reason of adding the same chunk in head of list is that it uses same
technique as pglz_hist_add. Now in pglz, it will not have repeat steps
from c~f, as it has concept of good_match which leads to get this done in
one go.

Being said above, I am really not sure, how much real world data falls
in above category and should we try to optimize based on above example,
but yes it will save some CPU cycles in current test we are using.

In the Rabin algorithm, we shouldn't try to find a longer match. The
match should end at the chunk end, period. Otherwise, you lose the
shift-resistant property of the algorithm.

But I do think there might be a bug here, which is
that, unless I'm misinterpreting something, hp is NOT the end of the
chunk. After calling pgrb_hash_init(), we've looked at the first FOUR
bytes of the input. If we find that we have a zero hash value at that
point, shouldn't the chunk size be 4, not 1? And similarly if we find
it after sucking in one more byte, shouldn't the chunk size be 5, not
2? Right now, we're deciding where the chunks should end based on the
data in the chunk plus the following 3 bytes, and that seems wonky. I
would expect us to include all of those bytes in the chunk.

It depends on how we define chunk, basically chunk size will be based
on the byte for which we consider hindex. The hindex for any byte is
calculated considering that byte and the following 3 bytes, so
after calling pgrb_hash_init(), even though we have looked at 4 bytes
but still the hindex is for first byte and thats why it consider
chunk size as 1, not 4.

Isn't it similar to how current pglz works, basically it also
uses next 4 bytes to calculate index (pglz_hist_idx) but still
does byte by byte comparison, here if we try to map to rabin's
delta encoding then always chunk size is 1.

I don't quite understand this. The point of the Rabin algorithm is to
split the old tuple up into chunks and then for those chunks in the
new tuple. For example, suppose the old tuple is
abcdefghijklmnopqrstuvwxyz. It might get split like this: abcdef
hijklmnopqrstuvw xyz. If any of those three chunks appear in the new
tuple, then we'll use them for compression. If not, we'll just copy
the literal bytes. If the chunks appear in the new tuple reordered or
shifted or with stuff inserted between one chunk at the next, we'll
still find them. Unless I'm confused, which is possible, what you're
doing is essentially looking at the string and spitting it in those
three places, but then recording the chunks as being three bytes
shorter than they really are. I don't see how that can be right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Robert Haas (#57)

On Tue, Jan 14, 2014 at 2:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yes, currently this applies to update, what I have in mind is that
in future if some one wants to use WAL compression for any other
operation like 'full_page_writes', then it can be easily extendible.

To be honest, I have not evaluated whether such a flag or compression
would make sense for full page writes, but I think it should be possible
while doing full page write (BkpBlock has RelFileNode) to check such a
flag if it's present.

Makes sense.

So shall I change it to string instead of bool and keep the name as
compress_wal or compress_wal_for_opr?

The reason of adding the same chunk in head of list is that it uses same
technique as pglz_hist_add. Now in pglz, it will not have repeat steps
from c~f, as it has concept of good_match which leads to get this done in
one go.

Being said above, I am really not sure, how much real world data falls
in above category and should we try to optimize based on above example,
but yes it will save some CPU cycles in current test we are using.

In the Rabin algorithm, we shouldn't try to find a longer match. The
match should end at the chunk end, period. Otherwise, you lose the
shift-resistant property of the algorithm.

Okay, it will work well for cases when most chunks in tuple are due
due to special pattern in it, but it will loose out on CPU cycles in
cases where most of the chunks are due to maximum chunk boundary
and most part of new tuple matches with old tuple. The reason is that
if the algorithm have some such property of finding longer matches than
chunk boundaries, then it can save us on calculating hash again and
again when we try to find match in old tuple.
However I think it is better to go with rabin's algorithm instead of adding
optimizations based on our own assumptions, because it is difficult to
predict the real world tuple data.

Isn't it similar to how current pglz works, basically it also
uses next 4 bytes to calculate index (pglz_hist_idx) but still
does byte by byte comparison, here if we try to map to rabin's
delta encoding then always chunk size is 1.

I don't quite understand this. The point of the Rabin algorithm is to
split the old tuple up into chunks and then for those chunks in the
new tuple. For example, suppose the old tuple is
abcdefghijklmnopqrstuvwxyz. It might get split like this: abcdef
hijklmnopqrstuvw xyz. If any of those three chunks appear in the new
tuple, then we'll use them for compression. If not, we'll just copy
the literal bytes. If the chunks appear in the new tuple reordered or
shifted or with stuff inserted between one chunk at the next, we'll
still find them. Unless I'm confused, which is possible, what you're
doing is essentially looking at the string and spitting it in those
three places, but then recording the chunks as being three bytes
shorter than they really are. I don't see how that can be right.

Today again spending some time on algorithm, I got the bug you
are pointing to and you are right in saying that chunk is shorter.
I think it should not be difficult to address this issue without affecting
most part of algorithm, let me try to handle it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#58)

On Tue, Jan 14, 2014 at 1:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 14, 2014 at 2:16 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jan 11, 2014 at 1:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yes, currently this applies to update, what I have in mind is that
in future if some one wants to use WAL compression for any other
operation like 'full_page_writes', then it can be easily extendible.

To be honest, I have not evaluated whether such a flag or compression
would make sense for full page writes, but I think it should be possible
while doing full page write (BkpBlock has RelFileNode) to check such a
flag if it's present.

Makes sense.

So shall I change it to string instead of bool and keep the name as
compress_wal or compress_wal_for_opr?

No. If we add full-page-write compression in the future, that can be
a separate option. But I doubt we'd want to set that at the table
level anyway; there's no particular reason that would be good for some
tables and bad for others (whereas in this case there is such a
reason).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Robert Haas (#55)

1 attachment(s)

On Fri, Jan 10, 2014 at 9:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:

2. Provide a new reloption to specify Wal compression
for update operation on table
Create table tbl(c1 char(100)) With (compress_wal = true);

Alternative options:
a. compress_wal can take input as operation, e.g. 'insert', 'update',
b. use alternate syntax:
Create table tbl(c1 char(100)) Compress Wal For Update;
c. anything better?

I think WITH (compress_wal = true) is pretty good. I don't understand
your point about taking the operation as input, because this only
applies to updates. But we could try to work "update" into the name
of the setting somehow, so as to be less likely to conflict with
future additions, like maybe wal_compress_update. I think the
alternate syntax you propose is clearly worse, because it would
involve adding new keywords, something we try to avoid.

Changed name to wal_compress_update in attached version of
patch.

Points to consider
-----------------------------

1. As the current algorithm store the entry for same chunks at head of list,
it will always find last but one chunk (we don't store last 4 bytes) for
long matching string during match phase in encoding (pgrb_delta_encode).

We can improve it either by storing same chunks at end of list instead of at
head or by trying to find a good_match technique used in lz algorithm.
Finding good_match technique can have overhead in some of the cases
when there is actually no match.

I don't see what the good_match thing has to do with anything in the
Rabin algorithm. But I do think there might be a bug here, which is
that, unless I'm misinterpreting something, hp is NOT the end of the
chunk. After calling pgrb_hash_init(), we've looked at the first FOUR
bytes of the input. If we find that we have a zero hash value at that
point, shouldn't the chunk size be 4, not 1? And similarly if we find
it after sucking in one more byte, shouldn't the chunk size be 5, not
2? Right now, we're deciding where the chunks should end based on the
data in the chunk plus the following 3 bytes, and that seems wonky. I
would expect us to include all of those bytes in the chunk.

Okay, I had modified the patch to consider the the data plus following
3 bytes inside chunk. To resolve it, after calling pgrb_hash_init(), we
need to initialize chunk size as 4 and once we find the chunk, increase
the next chunk start position by adding chunk size to it. Similarly during
match phase while copying unmatched, data, make sure to copy 3
bytes ahead of current position as those will not be considered for new
chunk.

In the Rabin algorithm, we shouldn't try to find a longer match. The
match should end at the chunk end, period. Otherwise, you lose the
shift-resistant property of the algorithm.

Okay for now, I have commented the code in pgrb_find_match() which
tries to find longer match after chunk boundary. The reason for just
commenting it rather than removing is that I fear it might have negative
impact on WAL reduction atleast for the cases where most of the data
is repetitive. I have done some performance test and data is at end of
mail, if you are okay with it, then I will remove this code altogether.

2. Another optimization that we can do in pgrb_find_match(), is that
currently if
it doesn't find the first chunk (chunk got by hash index) matching, it
continues to find the match in other chunks. I am not sure if there is any
benefit to search for other chunks if first one is not matching.

Well, if you took that out, I suspect it would hurt the compression
ratio. Unless the CPU savings are substantial, I'd leave it alone.

I kept this code intact.

3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c,
if we want to move, then we need to either duplicate some common macros
like pglz_out_tag or keep it common, but might be change the name.

+1 for a new file.

Done, after moving code to new file, it looks better.

4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively).
The point to consider is that if we keep bigger chunk sizes, then it can
save us on CPU cycles, but less reduction in Wal, on the other side if we
keep it small it can have better reduction in Wal but consume more CPU
cycles.

Whoa. That seems way too small. Since PGRB_PATTERN_AFTER_BITS is 4,
the average length of a chunk is about 16 bytes. It makes little
sense to have the maximum chunk size be 25% of the expected chunk
length. I'd recommend making the maximum chunk length something like
4 * PGRB_CONST_NUM, and the minimum chunk length maybe something like
4.

Okay changed as per suggestion.

5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we
can remove it before commit.

Let's remove it now.

Done.

7. docs needs to be updated, tab completion needs some work.

Tab completion can be skipped for now, but documentation is important.

Updated Create Table documentation.

Performance Data
-----------------------------
Non-default settings:
autovacuum =off
checkpoint_segments =128
checkpoint_timeout = 10min

Unpatched
-------------------
testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 1054923224 | 33.101135969162

After pgrb_delta_encoding_v4
---------------------------------------------

testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 877859144 | 30.6749138832092

Temporary Changes
(Revert Max Chunksize = 4 and logic of finding longer match)
---------------------------------------------------------------------------------------------

Summarization of test result:
1. If we don't try to find longer match, then it will have more tags in encoded
tuple which will increase the overall length of encoded tuple.

Note -
a. I have taken data just for one case to check whether the effect of changes
is acceptable.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pgrb_delta_encoding_v4.patchapplication/octet-stream; name=pgrb_delta_encoding_v4.patchDownload

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index a422edd..7a054f2 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1011,6 +1011,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index aea9d40..3bf5728 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..2724188 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b53ae87..ae74465 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2327,6 +2327,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 1ae9fa0..04dc17c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -26,7 +26,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	rowtypes.o regexp.o regproc.o ruleutils.o selfuncs.o \
 	tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \
 	network.o mac.o inet_cidr_ntop.o inet_net_pton.o \
-	ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \
+	ri_triggers.o pg_lzcompress.o pg_rbcompress.o pg_locale.o formatting.o \
 	ascii.o quote.o pgstatfuncs.o encode.o dbsize.o genfile.o trigfuncs.o \
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 30f1c0a..f45b74a 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -674,7 +674,6 @@ pglz_compress(const char *source, int32 slen, PGLZ_Header *dest,
 	return true;
 }
 
-
 /* ----------
  * pglz_decompress -
  *
diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..2e5914a
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,683 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is an implementation of Rabin fingerprinting scheme for 
+ *      PostgreSQL. It uses a simple history table to store variable 
+ *      length chunks and then compress source string using the history
+ *		data. It uses LZ compression format to store encoded string except
+ *		for using PGLZ_Header.
+ *
+ *
+ *		The decompression algorithm works exactly same as LZ, except
+ *		for processing of its PGLZ_Header, as Rabin algorithm doesn't
+ *		need header.
+ *
+ *
+ *		The compression algorithm
+ *
+ *		The 'source' is encoded using rabin finger print method.
+ *		We use a rolling hash function to divide up the history data into chunks
+ *		of a given average size.  To do that we scan the history data, compute a
+ *		rolling hash value at each byte, and each time the bottom
+ *		PGRB_PATTERN_AFTER_BITS bits are zero, we consider that to be the end of
+ *		a chunk. To make the chunk size more predictable and handle worst case
+ *		(the data doesn't contain special pattern) we use min and max chunk
+ *		boundaries. Enter all the chunks into a hash table.
+ *
+ *		Then, we scan the input we want to compress and divide it into chunks in
+ *		the same way.  Chunks that don't exist in the history data get copied to
+ *		the output, while those that do get replaced with a reference to their
+ *		position in the history data.
+ *
+ *		This algorithm was suggested by Robert Haas for doing WAL compression and
+ *		developed by Amit Kapila.
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_rbcompress.h"
+
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGRB_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
+#define PGRB_HISTORY_SIZE		4096
+/*
+ * Popular and effective rolling hash function treats every substring
+ * as a number in some base, the base being usually a prime.
+ * Experiments suggest that prime number 11 generates better chunks.
+ * Currently experiements have been done on repetitive data, more
+ * experiments can be done with different kind of data to choose
+ * more appropriate prime number.
+ */
+#define PGRB_HKEY_PRIME			11	 /* prime number used for rolling hash */
+#define PGRB_HKEY_SQUARE_PRIME			11 * 11	 /* square of prime number used for rolling hash */
+#define PGRB_HKEY_CUBE_PRIME			11 * 11	 * 11 /* cube of prime number used for rolling hash */
+
+/* number of bits after which to check for constant pattern to form chunk */
+#define PGRB_PATTERN_AFTER_BITS	4
+#define PGRB_CONST_NUM			(1 << PGRB_PATTERN_AFTER_BITS)
+#define PGRB_MIN_CHUNK_SIZE		4
+#define PGRB_MAX_CHUNK_SIZE		4 * PGRB_CONST_NUM
+
+
+
+/* ----------
+ * The provided standard strategies
+ * ----------
+ */
+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	25,							/* Require 25% compression rate, or not worth
+								 * it */
+	1024,						/* Give up if no compression in the first 1KB */
+	128,						/* Stop history lookup if a match of 128 bytes
+								 * is found */
+	10							/* Lower good match size by 10% at every loop
+								 * iteration */
+};
+const PGRB_Strategy *const PGRB_strategy_default = &strategy_default_data;
+
+
+typedef struct PGRB_HistEntry
+{
+	struct PGRB_HistEntry *next;	/* links for my hash key's list */
+	int		hindex;			/* my current hash key */
+	const char *ck_start_pos;	/* chunk start position */
+	int16       ck_size;		/* chunk end position */
+} PGRB_HistEntry;
+
+/* ----------
+ * Statically allocated work arrays for history
+ * ----------
+ */
+static int16 hist_start[PGRB_MAX_HISTORY_LISTS];
+static PGRB_HistEntry rb_hist_entries[PGRB_HISTORY_SIZE + 1];
+
+/*
+ * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
+ * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ */
+#define INVALID_ENTRY			0
+#define RB_INVALID_ENTRY_PTR		(&rb_hist_entries[INVALID_ENTRY])
+
+
+/*
+ * pgrb_hash_init and pgrb_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pgrb_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pgrb_hash_init(_p,hindex,a,b,c,d)									\
+	do {																	\
+			a = _p[0];														\
+			b = _p[1];														\
+			c = _p[2];														\
+			d = _p[3];														\
+			hindex = (a * PGRB_HKEY_CUBE_PRIME + b * PGRB_HKEY_SQUARE_PRIME + c * PGRB_HKEY_PRIME + d);						\
+	} while (0)
+
+#define pgrb_hash_roll(_p,hindex,a,b,c,d)								    \
+	do {																	\
+		/* subtract old a, 1000 % 11 = 10 */								\
+		hindex -= (a * PGRB_HKEY_CUBE_PRIME);												\
+		/* add new byte */											        \
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = PGRB_HKEY_PRIME * hindex + d;											\
+	} while (0)
+
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGRB_HISTORY_SIZE.
+ */
+#define pgrb_hist_add_no_recycle(_hs,_he,_hn,_s,_ck_size, _hindex) \
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGRB_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->hindex = _hindex;										\
+			__myhe->ck_start_pos  = (_s);									\
+			__myhe->ck_size  = (_ck_size);									\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_find_match -
+ *
+ * ----------
+ */
+static inline int
+pgrb_find_match(int16 *hstart, const char* input_chunk_start,
+				int input_chunk_size, int16 *lenp, int16 *offp,
+				const char *hend, int hindex)
+{
+	PGRB_HistEntry *hent;
+	int16		hentno;
+	const char  *hp;
+	const char  *ip;
+	int16		history_chunk_size;
+	int16		matchlen;
+	bool		match_chunk;
+
+	hentno = hstart[hindex];
+	hent = &rb_hist_entries[hentno];
+
+	while (hent != RB_INVALID_ENTRY_PTR)
+	{
+		/*
+		 * if history and input chunk size doesn't match, then try to find
+		 * match with next chunks in list.
+		 */
+		history_chunk_size = hent->ck_size;
+		if (history_chunk_size != input_chunk_size)
+		{
+			hent = hent->next;
+			continue;
+		}
+
+		match_chunk = true;
+		matchlen = history_chunk_size;
+		hp = hent->ck_start_pos;
+		ip = input_chunk_start;
+
+		/*
+		 * first try to match uptil chunksize and if the data is
+		 * same for chunk, then try to match further to get the
+		 * larger match. if there is a match at end of chunk, it
+		 * is possible that further bytes in string will match.
+		 */
+		while (history_chunk_size-- > 0)
+		{
+			if (*hp++ != *ip++)
+			{
+				match_chunk = false;
+				break;
+			}
+			else
+				match_chunk = true;
+		}
+
+		/* if (match_chunk)
+		{
+			while (*ip == *hp)
+			{
+				matchlen++;
+				ip++;
+				hp++;
+			}
+		} */
+		if (!match_chunk)
+		{
+			hent = hent->next;
+			continue;
+		}
+
+		*offp = hend - hent->ck_start_pos;
+		*lenp = matchlen;
+
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
+
+/* ----------
+ * Rabin's Delta Encoding - Compresses source data by referring history.
+ *							
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dp_chunk_start;
+	const char *dend = source + slen;
+	const char *dp_unmatched_chunk_start;
+	const char *hp = history;
+	const char *hp_chunk_start;
+	const char *hend = history + hlen;
+	const char *dp_out;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	bool		unmatched_data = false;
+	int16		match_len = 0;
+	int16		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		need_rate;
+	int			hist_next = 1;
+	int			hashsz;
+	int			mask;
+	int32		a,
+				b,
+				c,
+				d;
+	int32		hindex;
+	int16		len = 0;
+	int16		dp_chunk_size = 1;
+	int16		hp_chunk_size = 1;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGRB_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen/PGRB_MIN_CHUNK_SIZE);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * rb_hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	hp_chunk_start = hp;
+	pgrb_hash_init(hp, hindex, a, b, c, d);
+	hp_chunk_size = 4;
+	while (hp < hend - 4)
+	{
+		/*
+		 * If we found the special pattern or reached max chunk size,
+		 * then consider it as a chunk and add the same to history
+		 * table.  If there is a data at end of string which doesn't
+		 * have special pattern and is less than max chunk size, we also
+		 * consider that as a chunk.
+		 */
+		if ((hp_chunk_size >= PGRB_MIN_CHUNK_SIZE &&
+			(hindex % PGRB_CONST_NUM) == 0) ||
+			hp_chunk_size == PGRB_MAX_CHUNK_SIZE ||
+			hp == hend - 5)
+		{
+			pgrb_hist_add_no_recycle(hist_start, rb_hist_entries,
+									 hist_next, hp_chunk_start,
+									 hp_chunk_size, (hindex & mask));
+			hp++;
+			hp_chunk_start += hp_chunk_size;
+			hp_chunk_size = 1;
+		}
+		else
+		{
+			hp++;
+			hp_chunk_size++;
+		}
+		pgrb_hash_roll(hp, hindex, a, b, c, d);
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	dp_chunk_start = dp;
+	dp_unmatched_chunk_start = dp;
+	pgrb_hash_init(dp, hindex, a, b, c, d);
+	dp_chunk_size = 4;
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (found_match)
+		{
+			if (bp - bstart >= result_max)
+				return false;
+		}
+		else
+		{
+			if (len >= result_max)
+				return false;
+		}
+
+		/*
+		 * Try to find a match in the history
+		 */
+		if ((dp_chunk_size >= PGRB_MIN_CHUNK_SIZE &&
+			(hindex % PGRB_CONST_NUM) == 0) ||
+			dp_chunk_size == PGRB_MAX_CHUNK_SIZE ||
+			dp == dend - 5)
+		{
+			if (pgrb_find_match(hist_start, dp_chunk_start,
+								dp_chunk_size, &match_len, &match_off,
+								hend, (hindex & mask)))
+			{
+				/*
+				 * Create the tag and add history entries for all matched
+				 * characters and ensure to copy any unmatched data till
+				 * this point. Currently this code only delays copy of
+				 * unmatched data in begining. While copying unmatched,
+				 * data, make sure to copy 3 ahead of current position as
+				 * those will not be considered for new chunk.
+				 */
+				if (unmatched_data)
+				{
+					while (dp_unmatched_chunk_start <= dp + 3)
+					{
+						   pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_unmatched_chunk_start);
+						   dp_unmatched_chunk_start++;
+					}
+					unmatched_data = false;
+				}
+				pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+				found_match = true;
+				if (match_len > dp_chunk_size)
+					dp += match_len - dp_chunk_size;
+			}
+			else
+			{
+				/* No match found, copy chunk into destination buffer. */
+				if (found_match)
+				{
+					dp_out = dp_chunk_start;
+					while (dp_out <= dp + 3)
+					{
+						   pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_chunk_start);
+						   dp_out++;
+					}
+				}
+				else
+					unmatched_data = true;
+			}
+			len++;
+			dp++;
+			dp_chunk_start += dp_chunk_size;
+			dp_chunk_size = 1;
+		}
+		else
+		{
+			dp_chunk_size++;
+			len++;
+			dp++;
+		}
+		pgrb_hash_roll(dp, hindex, a, b, c, d);
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 017e74d..0aae944 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -278,6 +278,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_rbcompress.h b/src/include/utils/pg_rbcompress.h
new file mode 100644
index 0000000..84f154d
--- /dev/null
+++ b/src/include/utils/pg_rbcompress.h
@@ -0,0 +1,79 @@
+/* ----------
+ * pg_rbcompress.h -
+ *
+ *	Definitions for the builtin Rabin compressor
+ *
+ * src/include/utils/pg_rbcompress.h
+ * ----------
+ */
+
+#ifndef _PG_RBCOMPRESS_H_
+#define _PG_RBCOMPRESS_H_
+
+
+/* ----------
+ * PGRB_Strategy -
+ *
+ *		Some values that control the compression algorithm.
+ *
+ *		min_input_size		Minimum input data size to consider compression.
+ *
+ *		max_input_size		Maximum input data size to consider compression.
+ *
+ *		min_comp_rate		Minimum compression rate (0-99%) to require.
+ *							Regardless of min_comp_rate, the output must be
+ *							smaller than the input, else we don't store
+ *							compressed.
+ *
+ *		first_success_by	Abandon compression if we find no compressible
+ *							data within the first this-many bytes.
+ *
+ *		match_size_good		The initial GOOD match size when starting history
+ *							lookup. When looking up the history to find a
+ *							match that could be expressed as a tag, the
+ *							algorithm does not always walk back entirely.
+ *							A good match fast is usually better than the
+ *							best possible one very late. For each iteration
+ *							in the lookup, this value is lowered so the
+ *							longer the lookup takes, the smaller matches
+ *							are considered good.
+ *
+ *		match_size_drop		The percentage by which match_size_good is lowered
+ *							after each history check. Allowed values are
+ *							0 (no change until end) to 100 (only check
+ *							latest history entry at all).
+ * ----------
+ */
+typedef struct PGRB_Strategy
+{
+	int32		min_input_size;
+	int32		max_input_size;
+	int32		min_comp_rate;
+	int32		first_success_by;
+	int32		match_size_good;
+	int32		match_size_drop;
+} PGRB_Strategy;
+
+
+/* ----------
+ * The standard strategies
+ *
+ *		PGRB_strategy_default		Recommended default strategy for WAL
+ *									compression.
+ * ----------
+ */
+extern const PGRB_Strategy *const PGRB_strategy_default;
+
+
+/* ----------
+ * Global function declarations
+ * ----------
+ */
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGRB_Strategy *strategy);
+extern void pgrb_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
+
+#endif   /* _PG_RBCOMPRESS_H_ */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#61

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#60)

On Wed, Jan 15, 2014 at 5:58 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 10, 2014 at 9:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Performance Data
-----------------------------
Non-default settings:
autovacuum =off
checkpoint_segments =128
checkpoint_timeout = 10min

Unpatched
-------------------
testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 1054923224 | 33.101135969162

After pgrb_delta_encoding_v4
---------------------------------------------

testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 877859144 | 30.6749138832092

Temporary Changes
(Revert Max Chunksize = 4 and logic of finding longer match)
---------------------------------------------------------------------------------------------

testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 677337304 | 25.4048750400543

Sorry, minor correction in last mail, the last data (Temporary Changes) is
just by reverting logic of finding longer match in pgrb_find_match().

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#60)

On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Unpatched
-------------------
testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 1054923224 | 33.101135969162

After pgrb_delta_encoding_v4
---------------------------------------------

testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 877859144 | 30.6749138832092

Temporary Changes
(Revert Max Chunksize = 4 and logic of finding longer match)
---------------------------------------------------------------------------------------------

testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 677337304 | 25.4048750400543

Sure, but watch me not care.

If we're interested in taking advantage of the internal
compressibility of tuples, we can do a lot better than this patch. We
can compress the old tuple and the new tuple. We can compress
full-page images. We can compress inserted tuples. But that's not
the point of this patch.

The point of *this* patch is to exploit the fact that the old and new
tuples are likely to be very similar, NOT to squeeze out every ounce
of compression from other sources.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Robert Haas (#62)

On Thu, Jan 16, 2014 at 12:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Unpatched
-------------------
testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 1054923224 | 33.101135969162

After pgrb_delta_encoding_v4
---------------------------------------------

testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 877859144 | 30.6749138832092

Temporary Changes
(Revert Max Chunksize = 4 and logic of finding longer match)
---------------------------------------------------------------------------------------------

testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 677337304 | 25.4048750400543

Sure, but watch me not care.

If we're interested in taking advantage of the internal
compressibility of tuples, we can do a lot better than this patch. We
can compress the old tuple and the new tuple. We can compress
full-page images. We can compress inserted tuples. But that's not
the point of this patch.

The point of *this* patch is to exploit the fact that the old and new
tuples are likely to be very similar, NOT to squeeze out every ounce
of compression from other sources.

Okay, got your point.
Another minor thing is that in latest patch which I have sent yesterday,
I have modified it such that while formation of chunks if there is a data
at end of string which doesn't have special pattern and is less than max
chunk size, we also consider that as a chunk. The reason of doing this
was that let us say if we have 104 bytes string which contains no special
bit pattern, then it will just have one 64 byte chunk and will leave the
remaining bytes, which might miss the chance of doing compression for
that data.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#63)

On Thu, Jan 16, 2014 at 12:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 16, 2014 at 12:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 15, 2014 at 7:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Unpatched
-------------------
testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 1054923224 | 33.101135969162

After pgrb_delta_encoding_v4
---------------------------------------------

testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 877859144 | 30.6749138832092

Temporary Changes
(Revert Max Chunksize = 4 and logic of finding longer match)
---------------------------------------------------------------------------------------------

testname | wal_generated |
duration
----------------------------------------------------------+----------------------+------------------
one short and one long field, no change | 677337304 | 25.4048750400543

Sure, but watch me not care.

If we're interested in taking advantage of the internal
compressibility of tuples, we can do a lot better than this patch. We
can compress the old tuple and the new tuple. We can compress
full-page images. We can compress inserted tuples. But that's not
the point of this patch.

The point of *this* patch is to exploit the fact that the old and new
tuples are likely to be very similar, NOT to squeeze out every ounce
of compression from other sources.

Okay, got your point.
Another minor thing is that in latest patch which I have sent yesterday,
I have modified it such that while formation of chunks if there is a data
at end of string which doesn't have special pattern and is less than max
chunk size, we also consider that as a chunk. The reason of doing this
was that let us say if we have 104 bytes string which contains no special
bit pattern, then it will just have one 64 byte chunk and will leave the
remaining bytes, which might miss the chance of doing compression for
that data.

Yeah, that sounds right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#63)

1 attachment(s)

On Thu, Jan 16, 2014 at 12:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, got your point.
Another minor thing is that in latest patch which I have sent yesterday,
I have modified it such that while formation of chunks if there is a data
at end of string which doesn't have special pattern and is less than max
chunk size, we also consider that as a chunk. The reason of doing this
was that let us say if we have 104 bytes string which contains no special
bit pattern, then it will just have one 64 byte chunk and will leave the
remaining bytes, which might miss the chance of doing compression for
that data.

I ran Heikki's test suit on latest master and latest master plus
pgrb_delta_encoding_v4.patch on a PPC64 machine, but the results
didn't look too good. The only tests where the WAL volume changed by
more than half a percent were the "one short and one long field, no
change" test, where it dropped by 17%, but at the expense of an
increase in duration of 38%; and the "hundred tiny fields, half
nulled" test, where it dropped by 2% without a change in runtime.
Unfortunately, some of the tests where WAL didn't change significantly
took a runtime hit - in particular, "hundred tiny fields, half
changed" slowed down by 10% and "hundred tiny fields, all changed" by
8%. I've attached the full results in OpenOffice format.

Profiling the "one short and one long field, no change" test turns up
the following:

51.38% postgres pgrb_delta_encode
23.58% postgres XLogInsert
2.54% postgres heap_update
1.09% postgres LWLockRelease
0.90% postgres LWLockAcquire
0.89% postgres palloc0
0.88% postgres log_heap_update
0.84% postgres HeapTupleSatisfiesMVCC
0.75% postgres ExecModifyTable
0.73% postgres hash_search_with_hash_value

Yipes. That's a lot more than I remember this costing before. And I
don't understand why I'm seeing such a large time hit on this test
where you actually saw a significant time *reduction*. One
possibility is that you may have been running with a default
checkpoint_segments value or one that's low enough to force
checkpointing activity during the test. I ran with
checkpoint_segments=300.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#66

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Robert Haas (#65)

On Mon, Jan 20, 2014 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I ran Heikki's test suit on latest master and latest master plus
pgrb_delta_encoding_v4.patch on a PPC64 machine, but the results
didn't look too good. The only tests where the WAL volume changed by
more than half a percent were the "one short and one long field, no
change" test, where it dropped by 17%, but at the expense of an
increase in duration of 38%; and the "hundred tiny fields, half
nulled" test, where it dropped by 2% without a change in runtime.

Unfortunately, some of the tests where WAL didn't change significantly
took a runtime hit - in particular, "hundred tiny fields, half
changed" slowed down by 10% and "hundred tiny fields, all changed" by
8%.

I think this part of result is positive, as with earlier approaches here the
dip was > 20%. Refer the result posted at link:
/messages/by-id/51366323.8070606@vmware.com

I've attached the full results in OpenOffice format.

Profiling the "one short and one long field, no change" test turns up
the following:

51.38% postgres pgrb_delta_encode
23.58% postgres XLogInsert
2.54% postgres heap_update
1.09% postgres LWLockRelease
0.90% postgres LWLockAcquire
0.89% postgres palloc0
0.88% postgres log_heap_update
0.84% postgres HeapTupleSatisfiesMVCC
0.75% postgres ExecModifyTable
0.73% postgres hash_search_with_hash_value

Yipes. That's a lot more than I remember this costing before. And I
don't understand why I'm seeing such a large time hit on this test
where you actually saw a significant time *reduction*. One
possibility is that you may have been running with a default
checkpoint_segments value or one that's low enough to force
checkpointing activity during the test. I ran with
checkpoint_segments=300.

I ran with checkpoint_segments = 128 and when I ran with v4, I also
see similar WAL reduction as you are seeing, except that in my case
runtime for both are almost similar (i think in your case disk writes are
fast, so CPU overhead is more visible).
I think the major difference in above test is due to below part of code:

pgrb_find_match()
{
..
+ /* if (match_chunk)
+ {
+ while (*ip == *hp)
+ {
+ matchlen++;
+ ip++;
+ hp++;
+ }
+ } */
}

Basically if we don't go for longer match, then for test where most data
("one short and one long field, no change") is similar, it has to do below
extra steps with no advantage:
a. copy extra tags
b. calculation for rolling hash
c. finding the match
I think here major cost is due to 'a', but others might also not be free.
To confirm the theory, if we run the test by just un-commenting above
code, there can be significant change in both WAL reduction and
runtime for this test.

I have one idea to avoid the overhead of step a) which is to combine
the tags, means don't write the tag until it founds any un-matching data.
When any un-matched data is found, then combine all the previously
matched data and write it as one tag.
This should eliminate the overhead due to step a.

Can we think of anyway in which inspite of doing longer matches, we
can retain the sanctity of this approach?
One way could be to check if the match after chunk is long enough
that it matches rest of the string, but I think it can create problems
in some other cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67

Peter Geoghegan

pg@heroku.com

almost 12 years ago

In reply to: Robert Haas (#38)

On Mon, Nov 25, 2013 at 6:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

But even if that doesn't
pan out, I think the fallback position should not be "OK, well, if we
can't get decreased I/O for free then forget it" but rather "OK, if we
can't get decreased I/O for free then let's get decreased I/O in
exchange for increased CPU usage".

While I haven't been following the development of this patch, I will
note that on the face of it the latter seem like a trade-off I'd be
quite willing to make.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#66)

On Tue, Jan 21, 2014 at 2:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 20, 2014 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I ran Heikki's test suit on latest master and latest master plus
pgrb_delta_encoding_v4.patch on a PPC64 machine, but the results
didn't look too good. The only tests where the WAL volume changed by
more than half a percent were the "one short and one long field, no
change" test, where it dropped by 17%, but at the expense of an
increase in duration of 38%; and the "hundred tiny fields, half
nulled" test, where it dropped by 2% without a change in runtime.

Unfortunately, some of the tests where WAL didn't change significantly
took a runtime hit - in particular, "hundred tiny fields, half
changed" slowed down by 10% and "hundred tiny fields, all changed" by
8%.

I think this part of result is positive, as with earlier approaches here the
dip was > 20%. Refer the result posted at link:
/messages/by-id/51366323.8070606@vmware.com

I've attached the full results in OpenOffice format.

Profiling the "one short and one long field, no change" test turns up
the following:

51.38% postgres pgrb_delta_encode
23.58% postgres XLogInsert
2.54% postgres heap_update
1.09% postgres LWLockRelease
0.90% postgres LWLockAcquire
0.89% postgres palloc0
0.88% postgres log_heap_update
0.84% postgres HeapTupleSatisfiesMVCC
0.75% postgres ExecModifyTable
0.73% postgres hash_search_with_hash_value

Yipes. That's a lot more than I remember this costing before. And I
don't understand why I'm seeing such a large time hit on this test
where you actually saw a significant time *reduction*. One
possibility is that you may have been running with a default
checkpoint_segments value or one that's low enough to force
checkpointing activity during the test. I ran with
checkpoint_segments=300.

I ran with checkpoint_segments = 128 and when I ran with v4, I also
see similar WAL reduction as you are seeing, except that in my case
runtime for both are almost similar (i think in your case disk writes are
fast, so CPU overhead is more visible).
I think the major difference in above test is due to below part of code:
pgrb_find_match()
{
..
+ /* if (match_chunk)
+ {
+ while (*ip == *hp)
+ {
+ matchlen++;
+ ip++;
+ hp++;
+ }
+ } */
}
Basically if we don't go for longer match, then for test where most data
("one short and one long field, no change") is similar, it has to do below
extra steps with no advantage:
a. copy extra tags
b. calculation for rolling hash
c. finding the match
I think here major cost is due to 'a', but others might also not be free.
To confirm the theory, if we run the test by just un-commenting above
code, there can be significant change in both WAL reduction and
runtime for this test.

I have one idea to avoid the overhead of step a) which is to combine
the tags, means don't write the tag until it founds any un-matching data.
When any un-matched data is found, then combine all the previously
matched data and write it as one tag.
This should eliminate the overhead due to step a.

I think that's a good thing to try. Can you code it up?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Robert Haas (#68)

1 attachment(s)

On Wed, Jan 22, 2014 at 12:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 21, 2014 at 2:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jan 20, 2014 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I ran Heikki's test suit on latest master and latest master plus
pgrb_delta_encoding_v4.patch on a PPC64 machine, but the results
didn't look too good. The only tests where the WAL volume changed by
more than half a percent were the "one short and one long field, no
change" test, where it dropped by 17%, but at the expense of an
increase in duration of 38%; and the "hundred tiny fields, half
nulled" test, where it dropped by 2% without a change in runtime.

Unfortunately, some of the tests where WAL didn't change significantly
took a runtime hit - in particular, "hundred tiny fields, half
changed" slowed down by 10% and "hundred tiny fields, all changed" by
8%.

I think this part of result is positive, as with earlier approaches here the
dip was > 20%. Refer the result posted at link:
/messages/by-id/51366323.8070606@vmware.com

Basically if we don't go for longer match, then for test where most data
("one short and one long field, no change") is similar, it has to do below
extra steps with no advantage:
a. copy extra tags
b. calculation for rolling hash
c. finding the match
I think here major cost is due to 'a', but others might also not be free.
To confirm the theory, if we run the test by just un-commenting above
code, there can be significant change in both WAL reduction and
runtime for this test.

I have one idea to avoid the overhead of step a) which is to combine
the tags, means don't write the tag until it founds any un-matching data.
When any un-matched data is found, then combine all the previously
matched data and write it as one tag.
This should eliminate the overhead due to step a.

I think that's a good thing to try. Can you code it up?

I have tried to improve algorithm in another way so that we can get
benefit of same chunks during find match (something similar to lz).
The main change is to consider chunks at fixed boundary (4 byte)
and after finding match, try to find if there is a longer match than
current chunk. While finding longer match, it still takes care that
next bigger match should be at chunk boundary. I am not
completely sure about the chunk boundary may be 8 or 16 can give
better results.

I think now we can once run with this patch on high end m/c.

Below is the data on my laptop.

Non-Default Settings
checkpoint_segments = 128
checkpoint_timeout = 15min
autovacuum = off

Before Patch

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 1054922336 | 25.4784970283508
one short and one long field, no change | 1054914728 | 45.9248871803284
one short and one long field, no change | 1054911288 | 42.0877709388733
hundred tiny fields, all changed | 633946880 | 21.4810841083527
hundred tiny fields, all changed | 633943520 | 29.5192229747772
hundred tiny fields, all changed | 633943944 | 38.1980679035187
hundred tiny fields, half changed | 633946784 | 36.0654091835022
hundred tiny fields, half changed | 638136544 | 36.231675863266
hundred tiny fields, half changed | 633944072 | 30.7445759773254
hundred tiny fields, half nulled | 570130888 | 28.6964628696442
hundred tiny fields, half nulled | 569755584 | 32.7119750976562
hundred tiny fields, half nulled | 569760312 | 32.4714169502258
(12 rows)

After Patch

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 662239704 | 22.8768830299377
one short and one long field, no change | 662896760 | 22.466646194458
one short and one long field, no change | 662878736 | 17.6034708023071
hundred tiny fields, all changed | 633946192 | 24.5791938304901
hundred tiny fields, all changed | 634161120 | 25.7798039913177
hundred tiny fields, all changed | 633946416 | 23.761885881424
hundred tiny fields, half changed | 633945512 | 24.7001428604126
hundred tiny fields, half changed | 633947944 | 25.2069280147552
hundred tiny fields, half changed | 633946480 | 26.6489980220795
hundred tiny fields, half nulled | 492199720 | 28.7052059173584
hundred tiny fields, half nulled | 492194576 | 26.6311559677124
hundred tiny fields, half nulled | 492449408 | 25.2788209915161
(12 rows)

With above modifications, I could see ~37% WAL reduction for best case
"one short and one long field, no change" and ~13% for
"hundred tiny fields, half nulled". The duration is quite fluctuating in most
runs, so may be running it on some better m/c can give us a clear picture.

Any suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pgrb_delta_encoding_v5.patchapplication/octet-stream; name=pgrb_delta_encoding_v5.patchDownload

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 3d8d2eb..27c1479 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..2724188 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b333d82..92c4f00 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2327,6 +2327,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 1ae9fa0..04dc17c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -26,7 +26,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	rowtypes.o regexp.o regproc.o ruleutils.o selfuncs.o \
 	tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \
 	network.o mac.o inet_cidr_ntop.o inet_net_pton.o \
-	ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \
+	ri_triggers.o pg_lzcompress.o pg_rbcompress.o pg_locale.o formatting.o \
 	ascii.o quote.o pgstatfuncs.o encode.o dbsize.o genfile.o trigfuncs.o \
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..794757b
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,697 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is an implementation of Rabin fingerprinting scheme for 
+ *      PostgreSQL. It uses a simple history table to store variable 
+ *      length chunks and then compress source string using the history
+ *		data. It uses LZ compression format to store encoded string except
+ *		for using PGLZ_Header.
+ *
+ *
+ *		The decompression algorithm works exactly same as LZ, except
+ *		for processing of its PGLZ_Header, as Rabin algorithm doesn't
+ *		need header.
+ *
+ *
+ *		The compression algorithm
+ *
+ *		The 'source' is encoded using rabin finger print method.
+ *		We use a rolling hash function to divide up the history data into chunks
+ *		of a given average size.  To do that we scan the history data, compute a
+ *		rolling hash value at each byte, and each time the bottom
+ *		PGRB_PATTERN_AFTER_BITS bits are zero, we consider that to be the end of
+ *		a chunk. To make the chunk size more predictable and handle worst case
+ *		(the data doesn't contain special pattern) we use min and max chunk
+ *		boundaries. Enter all the chunks into a hash table.
+ *
+ *		Then, we scan the input we want to compress and divide it into chunks in
+ *		the same way.  Chunks that don't exist in the history data get copied to
+ *		the output, while those that do get replaced with a reference to their
+ *		position in the history data.
+ *
+ *		This algorithm was suggested by Robert Haas for doing WAL compression and
+ *		developed by Amit Kapila.
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_rbcompress.h"
+
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGRB_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
+#define PGRB_HISTORY_SIZE		4096
+#define PGRB_MAX_MATCH			273
+
+/*
+ * Popular and effective rolling hash function treats every substring
+ * as a number in some base, the base being usually a prime.
+ * Experiments suggest that prime number 11 generates better chunks.
+ * Currently experiements have been done on repetitive data, more
+ * experiments can be done with different kind of data to choose
+ * more appropriate prime number.
+ */
+#define PGRB_HKEY_PRIME			11	 /* prime number used for rolling hash */
+#define PGRB_HKEY_SQUARE_PRIME			11 * 11	 /* square of prime number used for rolling hash */
+#define PGRB_HKEY_CUBE_PRIME			11 * 11	 * 11 /* cube of prime number used for rolling hash */
+
+/* number of bits after which to check for constant pattern to form chunk */
+#define PGRB_PATTERN_AFTER_BITS	4
+#define PGRB_CONST_NUM			(1 << PGRB_PATTERN_AFTER_BITS)
+#define PGRB_MIN_CHUNK_SIZE		4
+#define PGRB_MAX_CHUNK_SIZE		4 * PGRB_CONST_NUM
+
+
+
+/* ----------
+ * The provided standard strategies
+ * ----------
+ */
+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	25,							/* Require 25% compression rate, or not worth
+								 * it */
+	1024,						/* Give up if no compression in the first 1KB */
+	128,						/* Stop history lookup if a match of 128 bytes
+								 * is found */
+	10							/* Lower good match size by 10% at every loop
+								 * iteration */
+};
+const PGRB_Strategy *const PGRB_strategy_default = &strategy_default_data;
+
+
+typedef struct PGRB_HistEntry
+{
+	struct PGRB_HistEntry *next;	/* links for my hash key's list */
+	int		hindex;			/* my current hash key */
+	const char *ck_start_pos;	/* chunk start position */
+	int16       ck_size;		/* chunk end position */
+} PGRB_HistEntry;
+
+/* ----------
+ * Statically allocated work arrays for history
+ * ----------
+ */
+static int16 hist_start[PGRB_MAX_HISTORY_LISTS];
+static PGRB_HistEntry rb_hist_entries[PGRB_HISTORY_SIZE + 1];
+
+/*
+ * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
+ * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ */
+#define INVALID_ENTRY			0
+#define RB_INVALID_ENTRY_PTR		(&rb_hist_entries[INVALID_ENTRY])
+
+
+/*
+ * pgrb_hash_init and pgrb_hash_roll can be use to calculate the hash in
+ * a rolling fashion. First, call pgrb_hash_init, with a pointer to the first
+ * byte. Then call pglz_hash_roll for every subsequent byte. After each
+ * pglz_hash_roll() call, hindex holds the hash of the current byte.
+ *
+ * a,b,c,d are local variables these macros use to store state. These macros
+ * don't check for end-of-buffer like pglz_hist_idx() does, so these cannot be
+ * used on the last 3 bytes of input.
+ */
+#define pgrb_hash_init(_p,hindex,a,b,c,d)									\
+	do {																	\
+			a = _p[0];														\
+			b = _p[1];														\
+			c = _p[2];														\
+			d = _p[3];														\
+			hindex = (a * PGRB_HKEY_CUBE_PRIME + b * PGRB_HKEY_SQUARE_PRIME + c * PGRB_HKEY_PRIME + d);						\
+	} while (0)
+
+#define pgrb_hash_roll(_p,hindex,a,b,c,d)								    \
+	do {																	\
+		/* subtract old a, 1000 % 11 = 10 */								\
+		hindex -= (a * PGRB_HKEY_CUBE_PRIME);												\
+		/* add new byte */											        \
+		a = b; b = c; c = d; d = _p[3];										\
+		hindex = PGRB_HKEY_PRIME * hindex + d;											\
+	} while (0)
+
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGRB_HISTORY_SIZE.
+ */
+#define pgrb_hist_add_no_recycle(_hs,_he,_hn,_s,_ck_size, _hindex) \
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGRB_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->hindex = _hindex;										\
+			__myhe->ck_start_pos  = (_s);									\
+			__myhe->ck_size  = (_ck_size);									\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_find_match -
+ *
+ * ----------
+ */
+static inline int
+pgrb_find_match(int16 *hstart, const char* input_chunk_start,
+				int input_chunk_size, int16 *lenp, int16 *offp,
+				const char *hend, int hindex)
+{
+	PGRB_HistEntry *hent;
+	int16		hentno;
+	const char  *hp;
+	const char  *ip;
+	int16		history_chunk_size;
+	int32		len = 0;
+	int32		off = 0;
+
+	hentno = hstart[hindex];
+	hent = &rb_hist_entries[hentno];
+
+	while (hent != RB_INVALID_ENTRY_PTR)
+	{
+		int32		thisoff;
+		int32		thislen;
+		int32		maxlen;
+
+		thislen = 0;
+		hp = hent->ck_start_pos;
+		ip = input_chunk_start;
+
+		maxlen = PGRB_MAX_MATCH;
+		if (hend - hp < maxlen)
+			maxlen = hend - hp;
+
+		thisoff = hend - hent->ck_start_pos;
+		if (thisoff >= 0x0fff)
+			break;
+
+		/*
+		 * first try to match uptil chunksize and if the data is
+		 * same for chunk, then try to match further to get the
+		 * larger match. if there is a match at end of chunk, it
+		 * is possible that further bytes in string will match.
+		 */
+		while (*ip == *hp && thislen < maxlen)
+		{
+			history_chunk_size = PGRB_MIN_CHUNK_SIZE;
+			while (history_chunk_size > 0)
+			{
+				if (*hp++ != *ip++)
+					break;
+				else
+					--history_chunk_size;
+			}
+
+			/* consider only complete chunk matches. */
+			if (history_chunk_size == 0)
+				thislen += PGRB_MIN_CHUNK_SIZE;
+		}
+		
+		/*
+		 * Remember this match as the best (if it is)
+		 */
+		if (thislen > len)
+		{
+			len = thislen;
+			off = thisoff;
+		}
+
+		/*
+		 * Advance to the next history entry
+		 */
+		hent = hent->next;	
+	}
+
+	/*
+	 * Return match information only if it results at least in one chunk
+	 * reduction.
+	 */
+	if (len >= PGRB_MIN_CHUNK_SIZE)
+	{
+		*lenp = len;
+		*offp = off;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
+
+/* ----------
+ * Rabin's Delta Encoding - Compresses source data by referring history.
+ *							
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dp_chunk_start;
+	const char *dend = source + slen;
+	const char *dp_unmatched_chunk_start;
+	const char *hp = history;
+	const char *hp_chunk_start;
+	const char *hend = history + hlen;
+	const char *dp_out;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	bool		unmatched_data = false;
+	int16		match_len = 0;
+	int16		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		need_rate;
+	int			hist_next = 1;
+	int			hashsz;
+	int			mask;
+	int32		a,
+				b,
+				c,
+				d;
+	int32		hindex;
+	int16		len = 0;
+	int16		dp_chunk_size = 1;
+	int16		hp_chunk_size = 1;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGRB_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen/PGRB_MIN_CHUNK_SIZE);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * rb_hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	hp_chunk_start = hp;
+	pgrb_hash_init(hp, hindex, a, b, c, d);
+	hp_chunk_size = 4;
+	while (hp < hend - 4)
+	{
+		/*
+		 * If we found the special pattern or reached max chunk size,
+		 * then consider it as a chunk and add the same to history
+		 * table.  If there is a data at end of string which doesn't
+		 * have special pattern and is less than max chunk size, we also
+		 * consider that as a chunk.
+		 */
+		if (hp_chunk_size == PGRB_MIN_CHUNK_SIZE ||
+			hp == hend - 5)
+		{
+			pgrb_hist_add_no_recycle(hist_start, rb_hist_entries,
+									 hist_next, hp_chunk_start,
+									 hp_chunk_size, (hindex & mask));
+			hp++;
+			hp_chunk_start += hp_chunk_size;
+			hp_chunk_size = 1;
+		}
+		else
+		{
+			hp++;
+			hp_chunk_size++;
+		}
+		pgrb_hash_roll(hp, hindex, a, b, c, d);
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	match_off = 0;
+	dp_chunk_start = dp;
+	dp_unmatched_chunk_start = dp;
+	pgrb_hash_init(dp, hindex, a, b, c, d);
+	dp_chunk_size = 4;
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (found_match)
+		{
+			if (bp - bstart >= result_max)
+				return false;
+		}
+		else
+		{
+			if (len >= result_max)
+				return false;
+		}
+
+		/*
+		 * Try to find a match in the history
+		 */
+		if (dp_chunk_size == PGRB_MIN_CHUNK_SIZE ||
+			dp == dend - 5)
+		{
+			if (pgrb_find_match(hist_start, dp_chunk_start,
+								dp_chunk_size, &match_len, &match_off,
+								hend, (hindex & mask)))
+			{
+				/*
+				 * Create the tag and add history entries for all matched
+				 * characters and ensure to copy any unmatched data till
+				 * this point. Currently this code only delays copy of
+				 * unmatched data in begining.
+				 */
+				if (unmatched_data)
+				{
+					while (dp_unmatched_chunk_start < dp)
+					{
+						   pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_unmatched_chunk_start);
+						   dp_unmatched_chunk_start++;
+					}
+					unmatched_data = false;
+				}
+				pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+				found_match = true;
+				/*
+				 * dp is normally ahead of dp_chunk_start and match_len is w.r.t
+				 * dp_chunk_start, so we need to subtract the difference of dp and
+				 * dp_chunk_start from match length.
+				 */
+				if (match_len > dp_chunk_size)
+					dp += match_len - (dp - dp_chunk_start);
+			}
+			else
+			{
+				/* No match found, copy chunk into destination buffer. */
+				if (found_match)
+				{
+					dp_out = dp_chunk_start;
+					while (dp_out <= dp + 3)
+					{
+						   pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_chunk_start);
+						   dp_out++;
+					}
+				}
+				else
+					unmatched_data = true;
+			}
+			len++;
+			dp++;
+			dp_chunk_start += dp_chunk_size;
+			dp_chunk_size = 1;
+		}
+		else
+		{
+			dp_chunk_size++;
+			len++;
+			dp++;
+		}
+		pgrb_hash_roll(dp, hindex, a, b, c, d);
+	}
+
+	if (!found_match)
+		return false;
+
+	/* last dp increment is not accounted in encoded buffer. */
+	--dp;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 47e3022..51d6925 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_rbcompress.h b/src/include/utils/pg_rbcompress.h
new file mode 100644
index 0000000..84f154d
--- /dev/null
+++ b/src/include/utils/pg_rbcompress.h
@@ -0,0 +1,79 @@
+/* ----------
+ * pg_rbcompress.h -
+ *
+ *	Definitions for the builtin Rabin compressor
+ *
+ * src/include/utils/pg_rbcompress.h
+ * ----------
+ */
+
+#ifndef _PG_RBCOMPRESS_H_
+#define _PG_RBCOMPRESS_H_
+
+
+/* ----------
+ * PGRB_Strategy -
+ *
+ *		Some values that control the compression algorithm.
+ *
+ *		min_input_size		Minimum input data size to consider compression.
+ *
+ *		max_input_size		Maximum input data size to consider compression.
+ *
+ *		min_comp_rate		Minimum compression rate (0-99%) to require.
+ *							Regardless of min_comp_rate, the output must be
+ *							smaller than the input, else we don't store
+ *							compressed.
+ *
+ *		first_success_by	Abandon compression if we find no compressible
+ *							data within the first this-many bytes.
+ *
+ *		match_size_good		The initial GOOD match size when starting history
+ *							lookup. When looking up the history to find a
+ *							match that could be expressed as a tag, the
+ *							algorithm does not always walk back entirely.
+ *							A good match fast is usually better than the
+ *							best possible one very late. For each iteration
+ *							in the lookup, this value is lowered so the
+ *							longer the lookup takes, the smaller matches
+ *							are considered good.
+ *
+ *		match_size_drop		The percentage by which match_size_good is lowered
+ *							after each history check. Allowed values are
+ *							0 (no change until end) to 100 (only check
+ *							latest history entry at all).
+ * ----------
+ */
+typedef struct PGRB_Strategy
+{
+	int32		min_input_size;
+	int32		max_input_size;
+	int32		min_comp_rate;
+	int32		first_success_by;
+	int32		match_size_good;
+	int32		match_size_drop;
+} PGRB_Strategy;
+
+
+/* ----------
+ * The standard strategies
+ *
+ *		PGRB_strategy_default		Recommended default strategy for WAL
+ *									compression.
+ * ----------
+ */
+extern const PGRB_Strategy *const PGRB_strategy_default;
+
+
+/* ----------
+ * Global function declarations
+ * ----------
+ */
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGRB_Strategy *strategy);
+extern void pgrb_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
+
+#endif   /* _PG_RBCOMPRESS_H_ */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#70

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#69)

1 attachment(s)

On Mon, Jan 27, 2014 at 12:03 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think that's a good thing to try. Can you code it up?

I have tried to improve algorithm in another way so that we can get
benefit of same chunks during find match (something similar to lz).
The main change is to consider chunks at fixed boundary (4 byte)
and after finding match, try to find if there is a longer match than
current chunk. While finding longer match, it still takes care that
next bigger match should be at chunk boundary. I am not
completely sure about the chunk boundary may be 8 or 16 can give
better results.

I think now we can once run with this patch on high end m/c.

Here are the results I got on the community PPC64 box.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#71

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Amit Kapila (#69)

On 01/27/2014 07:03 PM, Amit Kapila wrote:

I have tried to improve algorithm in another way so that we can get
benefit of same chunks during find match (something similar to lz).
The main change is to consider chunks at fixed boundary (4 byte)
and after finding match, try to find if there is a longer match than
current chunk. While finding longer match, it still takes care that
next bigger match should be at chunk boundary. I am not
completely sure about the chunk boundary may be 8 or 16 can give
better results.

Since you're only putting a value in the history every 4 bytes, you
wouldn't need to calculate the hash in a rolling fashion. You could just
take next four bytes, calculate hash, put it in history table. Then next
four bytes, calculate hash, and so on. Might save some cycles when
building the history table...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Jinyu

call_jinyu@126.com

almost 12 years ago

In reply to: Heikki Linnakangas (#71)

I think sort by string column is lower during merge join, maybe comparing function in sort need be refined to save some cycle. It’s the hot function when do sort.

Heikki Linnakangas <hlinnakangas@vmware.com>编写：

On 01/27/2014 07:03 PM, Amit Kapila wrote:

I have tried to improve algorithm in another way so that we can get
benefit of same chunks during find match (something similar to lz).
The main change is to consider chunks at fixed boundary (4 byte)
and after finding match, try to find if there is a longer match than
current chunk. While finding longer match, it still takes care that
next bigger match should be at chunk boundary. I am not
completely sure about the chunk boundary may be 8 or 16 can give
better results.

Since you're only putting a value in the history every 4 bytes, you
wouldn't need to calculate the hash in a rolling fashion. You could just
take next four bytes, calculate hash, put it in history table. Then next
four bytes, calculate hash, and so on. Might save some cycles when
building the history table...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Resolved by subject fallback

#73

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Heikki Linnakangas (#71)

1 attachment(s)

On 01/28/2014 07:01 PM, Heikki Linnakangas wrote:

On 01/27/2014 07:03 PM, Amit Kapila wrote:

I have tried to improve algorithm in another way so that we can get
benefit of same chunks during find match (something similar to lz).
The main change is to consider chunks at fixed boundary (4 byte)
and after finding match, try to find if there is a longer match than
current chunk. While finding longer match, it still takes care that
next bigger match should be at chunk boundary. I am not
completely sure about the chunk boundary may be 8 or 16 can give
better results.

Since you're only putting a value in the history every 4 bytes, you
wouldn't need to calculate the hash in a rolling fashion. You could just
take next four bytes, calculate hash, put it in history table. Then next
four bytes, calculate hash, and so on. Might save some cycles when
building the history table...

On a closer look, you're putting a chunk in the history table only every
four bytes, but you're *also* checking the history table for a match
only every four bytes. That completely destroys the shift-resistence of
the algorithm. For example, if the new tuple is an exact copy of the old
tuple, except for one additional byte in the beginning, the algorithm
would fail to recognize that. It would be good to add a test case like
that in the test suite.

You can skip bytes when building the history table, or when finding
matches, but not both. Or you could skip N bytes, and then check for
matches for the next four bytes, then skip again and so forth, as long
as you always check four consecutive bytes (because the hash function is
calculated from four bytes).

I couldn't resist the challenge, and started hacking this. First, some
observations from your latest patch (pgrb_delta_encoding_v5.patch):

1. There are a lot of comments and code that refers to "chunks", which
seem obsolete. For example, ck_size field in PGRB_HistEntry is always
set to a constant, 4, except maybe at the very end of the history
string. The algorithm has nothing to do with Rabin-Karp anymore.

2. The 'hindex' field in PGRB_HistEntry is unused. Also, ck_start_pos is
redundant with the index of the entry in the array, because the array is
filled in order. That only leaves us just the 'next' field, and that can
be represented as a int16 rather than a pointer. So, we really only need
a simple int16 array as the history entries array.

3. You're not gaining much by calculating the hash in a rolling fashion.
A single rolling step requires two multiplications and two sums, plus
shifting the variables around. Calculating the hash from scratch
requires three multiplications and three sums.

4. Given that we're not doing the Rabin-Karp variable-length chunk
thing, we could use a cheaper hash function to begin with. Like, the one
used in pglz. The multiply-by-prime method probably gives fewer
collisions than pglz's shift/xor method, but I don't think that matters
as much as computation speed. No-one has mentioned or measured the
effect of collisions in this thread; that either means that it's a
non-issue or that no-one's just realized how big a problem it is yet.
I'm guessing that it's not a problem, and if it is, it's mitigated by
only trying to find matches every N bytes; collisions would make finding
matches slower, and that's exactly what skipping helps with.

After addressing the above, we're pretty much back to PGLZ approach. I
kept the change to only find matches every four bytes, that does make
some difference. And I like having this new encoding code in a separate
file, not mingled with pglz stuff, it's sufficiently different that
that's better. I haven't done all much testing with this, so take it
with a grain of salt.

I don't know if this is better or worse than the other patches that have
been floated around, but I though I might as well share it..

- Heikki

Attachments:

back-to-pglz-like-delta-encoding-1.patchtext/x-diff; name=back-to-pglz-like-delta-encoding-1.patchDownload

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index 3d8d2eb..27c1479 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..2724188 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b333d82..92c4f00 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2327,6 +2327,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 1ae9fa0..04dc17c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -26,7 +26,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	rowtypes.o regexp.o regproc.o ruleutils.o selfuncs.o \
 	tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \
 	network.o mac.o inet_cidr_ntop.o inet_net_pton.o \
-	ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \
+	ri_triggers.o pg_lzcompress.o pg_rbcompress.o pg_locale.o formatting.o \
 	ascii.o quote.o pgstatfuncs.o encode.o dbsize.o genfile.o trigfuncs.o \
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..e1e9df6
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,522 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is a delta encoding scheme. For each "old" input offset, it
+ *		calculates a hash of the next four bytes, and stores the offset in
+ *		lookup table indexed by the hash. Then, the new input is scanned,
+ *		looking for matches in the old input, using the lookup table.
+ *		Output format is almost identical to the LZ compression format used
+ *		in pg_lzcompress.c
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_rbcompress.h"
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGRB_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
+#define PGRB_HISTORY_SIZE		4096
+#define PGRB_MAX_MATCH			273
+
+#define PGRB_MIN_CHUNK_SIZE		4
+
+
+
+/* ----------
+ * The provided standard strategies
+ * ----------
+ */
+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	25,							/* Require 25% compression rate, or not worth
+								 * it */
+	1024,						/* Give up if no compression in the first 1KB */
+	128,						/* Stop history lookup if a match of 128 bytes
+								 * is found */
+	10							/* Lower good match size by 10% at every loop
+								 * iteration */
+};
+const PGRB_Strategy *const PGRB_strategy_default = &strategy_default_data;
+
+/* ----------
+ * Statically allocated work arrays for history
+ * ----------
+ */
+static int16 hist_start[PGRB_MAX_HISTORY_LISTS];
+static uint16 rb_hist_entries[PGRB_HISTORY_SIZE + 1];
+
+/*
+ * Element 0 in hist_entries is unused, and means 'invalid'.
+ */
+#define INVALID_ENTRY			0
+
+
+/*
+ * Hash function, same as that used in pg_lzcompress.c.
+ */
+#define pgrb_hash_calc(_p,hindex)								\
+	hindex = ((_p)[0] << 6) ^ ((_p)[1] << 4) ^ ((_p)[2] << 2) ^ ((_p)[3])
+
+/* The same, calculated in a rolling fashion */
+#define pgrb_hash_init(_p,hindex)								\
+	hindex = ((_p)[0] << 4) ^ ((_p)[1] << 2) ^ ((_p)[2])
+
+#define pgrb_hash_roll(_p, hindex)								\
+	hindex = (hindex << 2) ^ (_p)[3]
+
+#define pgrb_hash_unroll(_p, hindex)								\
+	hindex = hindex ^ ((_p)[0] << 8)
+
+
+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_find_match -
+ *
+ * ----------
+ */
+static inline int
+pgrb_find_match(int16 *hstart, const char *hbegin,
+				const char* input_chunk_start,
+				int16 *lenp, int16 *offp,
+				const char *hend, int hindex)
+{
+	int16		hoff;
+	const char  *hp;
+	const char  *ip;
+	int32		len = 0;
+	int32		off = 0;
+
+	hoff = hstart[hindex];
+	while (hoff != INVALID_ENTRY)
+	{
+		int32		thisoff;
+		int32		thislen;
+		int32		maxlen;
+
+		/* We use offset 0 to mean invalid entries, the actual offset is +1 */
+		hp = hbegin + hoff - 1;
+		ip = input_chunk_start;
+
+		maxlen = PGRB_MAX_MATCH;
+		if (hend - hp < maxlen)
+			maxlen = hend - hp;
+
+		thisoff = hend - hp;
+		if (thisoff >= 0x0fff)
+			break;
+
+		/* How long is the match? */
+		thislen = 0;
+		while (*ip == *hp && thislen < maxlen)
+			thislen++;
+
+		/*
+		 * Remember this match as the best (if it is)
+		 */
+		if (thislen > len)
+		{
+			len = thislen;
+			off = thisoff;
+		}
+
+		/*
+		 * Advance to the next history entry
+		 */
+		hoff = rb_hist_entries[hoff];
+	}
+
+	/*
+	 * Return match information only if it results at least in one chunk
+	 * reduction.
+	 */
+	if (len >= PGRB_MIN_CHUNK_SIZE)
+	{
+		*lenp = len;
+		*offp = off;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
+
+/* ----------
+ * pgrb_delta_decode - Compresses source data by referring history.
+ *
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int16		match_len = 0;
+	int16		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		need_rate;
+	int			hoff;
+	int			hashsz;
+	int			mask;
+	int32		hindex;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGRB_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * rb_hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, INVALID_ENTRY, hashsz * sizeof(int16));
+
+	/* Populate the history lists */
+	pgrb_hash_init(hp, hindex);
+	hoff = 0;
+	while (hp < hend - 4)
+	{
+		pgrb_hash_roll(hp, hindex);
+
+		/* add this offset to the history table */
+		hoff++;
+		rb_hist_entries[hoff] = hist_start[hindex & mask];
+		hist_start[hindex & mask] = hoff;
+
+		pgrb_hash_unroll(hp, hindex);
+		hp++;
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		pgrb_hash_calc(dp, hindex);
+
+		/*
+		 * Try to find a match in the history
+		 */
+		if (pgrb_find_match(hist_start, history, dp,
+							&match_len, &match_off,
+							hend, (hindex & mask)))
+		{
+			pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			dp += match_len;
+			found_match = true;
+		}
+		else
+		{
+			/*
+			 * No match found. To save effort, we only check for matches every
+			 * four bytes. This naturally reduces the compression rate
+			 * somewhat, but we prefer speed over compression rate.
+			 */
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 47e3022..51d6925 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#74

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#73)

On Wed, Jan 29, 2014 at 3:41 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 01/28/2014 07:01 PM, Heikki Linnakangas wrote:

On 01/27/2014 07:03 PM, Amit Kapila wrote:

I have tried to improve algorithm in another way so that we can get
benefit of same chunks during find match (something similar to lz).
The main change is to consider chunks at fixed boundary (4 byte)
and after finding match, try to find if there is a longer match than
current chunk. While finding longer match, it still takes care that
next bigger match should be at chunk boundary. I am not
completely sure about the chunk boundary may be 8 or 16 can give
better results.

Since you're only putting a value in the history every 4 bytes, you
wouldn't need to calculate the hash in a rolling fashion. You could just
take next four bytes, calculate hash, put it in history table. Then next
four bytes, calculate hash, and so on. Might save some cycles when
building the history table...

First of all thanks for looking into patch.

Yes, this is right, we can save cycles by not doing rolling during hash
calculation and I was working to improve the patch on those lines. Earlier
it was their because of rabin's delta encoding where we need to check
for a special match after each byte.

On a closer look, you're putting a chunk in the history table only every
four bytes, but you're *also* checking the history table for a match only
every four bytes. That completely destroys the shift-resistence of the
algorithm.

You are right that it will loose the shift-resistence and even Robert has
pointed me this, that's why he wants to maintain the property of special
bytes at chunk boundaries as mentioned in Rabin encoding. The only
real reason to shift to fixed size was to improve CPU usage and I
thought most cases in Update will update the fixed length columns,
but it might not be true.

For example, if the new tuple is an exact copy of the old tuple,
except for one additional byte in the beginning, the algorithm would fail to
recognize that. It would be good to add a test case like that in the test
suite.

You can skip bytes when building the history table, or when finding matches,
but not both. Or you could skip N bytes, and then check for matches for the
next four bytes, then skip again and so forth, as long as you always check
four consecutive bytes (because the hash function is calculated from four
bytes).

Can we do something like:
Build Phase
a. Calculate the hash and add the entry in history table at every 4 bytes.

Match Phase
a. Calculate the hash in rolling fashion and try to find match at every byte.
b. When match is found then skip only in chunks, something like I was
    doing in find match function
+
+ /* consider only complete chunk matches. */
+ if (history_chunk_size == 0)
+ thislen += PGRB_MIN_CHUNK_SIZE;
+ }

Will this address the concern?

The main reason to process in chunks as much as possible is to save
cpu cycles. For example if we build hash table byte-by-byte, then even
for best case where most of tuple has a match, it will have reasonable
overhead due to formation of hash table.

I couldn't resist the challenge, and started hacking this. First, some
observations from your latest patch (pgrb_delta_encoding_v5.patch):

1. There are a lot of comments and code that refers to "chunks", which seem
obsolete. For example, ck_size field in PGRB_HistEntry is always set to a
constant, 4, except maybe at the very end of the history string. The
algorithm has nothing to do with Rabin-Karp anymore.

2. The 'hindex' field in PGRB_HistEntry is unused. Also, ck_start_pos is
redundant with the index of the entry in the array, because the array is
filled in order. That only leaves us just the 'next' field, and that can be
represented as a int16 rather than a pointer. So, we really only need a
simple int16 array as the history entries array.

3. You're not gaining much by calculating the hash in a rolling fashion. A
single rolling step requires two multiplications and two sums, plus shifting
the variables around. Calculating the hash from scratch requires three
multiplications and three sums.

4. Given that we're not doing the Rabin-Karp variable-length chunk thing, we
could use a cheaper hash function to begin with. Like, the one used in pglz.
The multiply-by-prime method probably gives fewer collisions than pglz's
shift/xor method, but I don't think that matters as much as computation
speed. No-one has mentioned or measured the effect of collisions in this
thread; that either means that it's a non-issue or that no-one's just
realized how big a problem it is yet. I'm guessing that it's not a problem,
and if it is, it's mitigated by only trying to find matches every N bytes;
collisions would make finding matches slower, and that's exactly what
skipping helps with.

After addressing the above, we're pretty much back to PGLZ approach.

Here during match phase, I think we can avoid copying literal bytes until
a match is found, that can save cycles for cases when old and new
tuple are mostly different.

I kept
the change to only find matches every four bytes, that does make some
difference. And I like having this new encoding code in a separate file, not
mingled with pglz stuff, it's sufficiently different that that's better. I
haven't done all much testing with this, so take it with a grain of salt.

I don't know if this is better or worse than the other patches that have
been floated around, but I though I might as well share it..

Thanks for sharing the patch, I can take the data and compare it with
existing approach, if you think the explanation to change algorithm I
have given above is okay.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Amit Kapila (#74)

On 01/29/2014 02:21 PM, Amit Kapila wrote:

On Wed, Jan 29, 2014 at 3:41 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

For example, if the new tuple is an exact copy of the old tuple,
except for one additional byte in the beginning, the algorithm would fail to
recognize that. It would be good to add a test case like that in the test
suite.

You can skip bytes when building the history table, or when finding matches,
but not both. Or you could skip N bytes, and then check for matches for the
next four bytes, then skip again and so forth, as long as you always check
four consecutive bytes (because the hash function is calculated from four
bytes).

Can we do something like:
Build Phase
a. Calculate the hash and add the entry in history table at every 4 bytes.

Match Phase
a. Calculate the hash in rolling fashion and try to find match at every byte.

Sure, that'll work. However, I believe it's cheaper to add entries to
the history table at every byte, and check for a match every 4 bytes. I
think you'll get more or less the same level of compression either way,
but adding to the history table is cheaper than checking for matches,
and we'd rather do the cheap thing more often than the expensive thing.

b. When match is found then skip only in chunks, something like I was
doing in find match function
+
+ /* consider only complete chunk matches. */
+ if (history_chunk_size == 0)
+ thislen += PGRB_MIN_CHUNK_SIZE;
+ }

Will this address the concern?

Hmm, so when checking if a match is truly a match, you compare the
strings four bytes at a time rather than byte-by-byte? That might work,
but I don't think that's a hot spot currently. In the profiling I did,
with a "nothing matches" test case, all the cycles were spent in the
history building, and finding matches. Finding out how long a match is
was negligible. Of course, it might be a different story with input
where the encoding helps and you have matches, but I think we were doing
pretty well in those cases already.

The main reason to process in chunks as much as possible is to save
cpu cycles. For example if we build hash table byte-by-byte, then even
for best case where most of tuple has a match, it will have reasonable
overhead due to formation of hash table.

Hmm. One very simple optimization we could do is to just compare the two
strings byte by byte, before doing anything else, to find any common
prefix they might have. Then output a tag for the common prefix, and run
the normal algorithm on the rest of the strings. In many real-world
tables, the 1-2 first columns are a key that never changes, so that
might work pretty well in practice. Maybe it would also be worthwhile to
do the same for any common suffix the tuples might have.

That would fail to find matches where you e.g. update the last column to
have the same value as the first column, and change nothing else, but
that's ok. We're not aiming for the best possible compression, just
trying to avoid WAL-logging data that wasn't changed.

Here during match phase, I think we can avoid copying literal bytes until
a match is found, that can save cycles for cases when old and new
tuple are mostly different.

I think the extra if's in the loop will actually cost you more cycles
than you save. You could perhaps have two copies of the main
match-finding loop though. First, loop without outputting anything,
until you find the first match. Then, output anything up to that point
as literals. Then fall into the second loop, which outputs any
non-matches byte by byte.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#75)

1 attachment(s)

On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 01/29/2014 02:21 PM, Amit Kapila wrote:
b. When match is found then skip only in chunks, something like I was
doing in find match function
+
+ /* consider only complete chunk matches. */
+ if (history_chunk_size == 0)
+ thislen += PGRB_MIN_CHUNK_SIZE;
+ }
Will this address the concern?
Hmm, so when checking if a match is truly a match, you compare the strings
four bytes at a time rather than byte-by-byte? That might work, but I don't
think that's a hot spot currently. In the profiling I did, with a "nothing
matches" test case, all the cycles were spent in the history building, and
finding matches. Finding out how long a match is was negligible. Of course,
it might be a different story with input where the encoding helps and you
have matches, but I think we were doing pretty well in those cases already.

I think the way you have improve forming history tables is damn good (using
very few instructions) and we might not need to proceed chunk wise, but still
I think it might give us benefits when all or most part of the old and new tuple
matches. Also for nothing or lesser match case, I think we can skip more
frequently till we find first match.
If we don't find any match for first 4 bytes, then skip 4 bytes and if we don't
find match again for next 8 bytes, then skip 8 bytes and keep on doing the
same until we find first match. There is a chance that we miss some bytes
for compression, but it should not effect much as we are doing this only till
we find first match and during match phase we always find longest match.
I have added this concept in new version of patch and it's more easy to
add this logic after I implemented your suggestion of breaking main match
loop to 2 loops.

The main reason to process in chunks as much as possible is to save
cpu cycles. For example if we build hash table byte-by-byte, then even
for best case where most of tuple has a match, it will have reasonable
overhead due to formation of hash table.

Hmm. One very simple optimization we could do is to just compare the two
strings byte by byte, before doing anything else, to find any common prefix
they might have. Then output a tag for the common prefix, and run the normal
algorithm on the rest of the strings. In many real-world tables, the 1-2
first columns are a key that never changes, so that might work pretty well
in practice. Maybe it would also be worthwhile to do the same for any common
suffix the tuples might have.

Is it possible to do for both prefix and suffix together, basically
the question I
have in mind is what will be deciding factor for switching from hash table
mechanism to string comparison mode for suffix. Do we switch when we find
long enough match?

Can we do this optimization after the basic version is acceptable?

Here during match phase, I think we can avoid copying literal bytes until
a match is found, that can save cycles for cases when old and new
tuple are mostly different.

I think the extra if's in the loop will actually cost you more cycles than
you save. You could perhaps have two copies of the main match-finding loop
though. First, loop without outputting anything, until you find the first
match. Then, output anything up to that point as literals. Then fall into
the second loop, which outputs any non-matches byte by byte.

This is certainly better way of implementation, I have changed the patch
accordingly and I have modified patch to address your comments, except
for removing hindex from history entry structure. I believe that the way you
have done in patch back-to-pglz-like-delta-encoding-1 is better and I will
change it after understanding the logic completely.

Few observations in patch (back-to-pglz-like-delta-encoding-1):

1.
+#define pgrb_hash_unroll(_p, hindex) \
+ hindex = hindex ^ ((_p)[0] << 8)

shouldn't it shift by 6 rather than by 8.

2.
+ if (bp - bstart >= result_max)
+ return false;

I think for nothing or lesser match case it will traverse whole tuple.
Can we optimize such that if there is no match till 75%, we can bail out.
Ideally, I think if we don't find any match in first 50 to 60%, we should
come out.

3. pg_rbcompress.h is missing.

I am still working on patch back-to-pglz-like-delta-encoding-1 to see if
it works well for all cases, but thought of sharing what I have done till
now to you.

After basic verification of back-to-pglz-like-delta-encoding-1, I will
take the data with both the patches and report the same.

Please let me know your thoughts?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pgrb_delta_encoding_v6.patchapplication/octet-stream; name=pgrb_delta_encoding_v6.patchDownload

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index aea9d40..3bf5728 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..2724188 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b333d82..92c4f00 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2327,6 +2327,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 1ae9fa0..04dc17c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -26,7 +26,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	rowtypes.o regexp.o regproc.o ruleutils.o selfuncs.o \
 	tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \
 	network.o mac.o inet_cidr_ntop.o inet_net_pton.o \
-	ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \
+	ri_triggers.o pg_lzcompress.o pg_rbcompress.o pg_locale.o formatting.o \
 	ascii.o quote.o pgstatfuncs.o encode.o dbsize.o genfile.o trigfuncs.o \
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..fcbb74c
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,634 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is a delta encoding scheme specific to PostgreSQL and designed
+ *		to compress similar tuples. It can be used as it is or extended for
+ *		other purpose in PostgrSQL if required.
+ *
+ *      It uses a simple history table to represent history data (old tuple)
+ *		and then compress source string (new tuple) using the history data.
+ *		It uses LZ compression format to store encoded string except for
+ *		using PGLZ_Header.
+ *
+ *
+ *		The decompression algorithm works exactly same as LZ, except
+ *		for processing of its PGLZ_Header.
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_rbcompress.h"
+
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGRB_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
+#define PGRB_HISTORY_SIZE		4096
+#define PGRB_MAX_MATCH			273
+#define PGRB_CHUNK_SIZE		4
+
+
+
+/* ----------
+ * The provided standard strategies
+ * ----------
+ */
+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	25,							/* Require 25% compression rate, or not worth
+								 * it */
+	1024,						/* Give up if no compression in the first 1KB */
+	128,						/* Stop history lookup if a match of 128 bytes
+								 * is found */
+	10							/* Lower good match size by 10% at every loop
+								 * iteration */
+};
+const PGRB_Strategy *const PGRB_strategy_default = &strategy_default_data;
+
+
+typedef struct PGRB_HistEntry
+{
+	struct PGRB_HistEntry *next;	/* links for my hash key's list */
+	const char *ck_start_pos;	/* chunk start position */
+} PGRB_HistEntry;
+
+/* ----------
+ * Statically allocated work arrays for history
+ * ----------
+ */
+static int16 hist_start[PGRB_MAX_HISTORY_LISTS];
+static PGRB_HistEntry rb_hist_entries[PGRB_HISTORY_SIZE + 1];
+
+/*
+ * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
+ * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ */
+#define INVALID_ENTRY			0
+#define RB_INVALID_ENTRY_PTR		(&rb_hist_entries[INVALID_ENTRY])
+
+
+/* ----------
+ * pgrb_hist_idx -
+ *
+ *		Computes the history table slot for the lookup by the next 4
+ *		characters in the input.
+ *
+ * NB: because we use the next 4 characters, we are not guaranteed to
+ * find 3-character matches; they very possibly will be in the wrong
+ * hash list.  This seems an acceptable tradeoff for spreading out the
+ * hash keys more.
+ * ----------
+ */
+#define pgrb_hist_idx(_s, _mask) (										\
+			  (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
+			  ((_s)[2] << 2) ^ (_s)[3]) & (_mask)				\
+		)
+
+/* The same, calculated in a rolling fashion */
+#define pgrb_hash_init(_p,hindex)								\
+	hindex = ((_p)[0] << 4) ^ ((_p)[1] << 2) ^ ((_p)[2])
+
+#define pgrb_hash_roll(_p, hindex)								\
+	hindex = (hindex << 2) ^ (_p)[3]
+
+#define pgrb_hash_unroll(_p, hindex)								\
+	hindex = hindex ^ ((_p)[0] << 6)
+
+/*
+ * An version of pglz_hist_add() that doesn't do recycling. Can be used if
+ * you know the input fits in PGRB_HISTORY_SIZE.
+ */
+#define pgrb_hist_add_no_recycle(_hs,_he,_hn,_s, _hindex) \
+do {									\
+			int16 *__myhsp = &(_hs)[_hindex];								\
+			PGRB_HistEntry *__myhe = &(_he)[_hn];							\
+			__myhe->next = &(_he)[*__myhsp];								\
+			__myhe->ck_start_pos  = (_s);									\
+			*__myhsp = _hn;													\
+			++(_hn);														\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_find_match -
+ *
+ * ----------
+ */
+static inline int
+pgrb_find_match(int16 *hstart, const char* input_chunk_start,
+				int16 *lenp, int16 *offp,
+				const char *hend, int hindex)
+{
+	PGRB_HistEntry *hent;
+	int16		hentno;
+	const char  *hp;
+	const char  *ip;
+	int16		history_chunk_size;
+	int32		len = 0;
+	int32		off = 0;
+
+	hentno = hstart[hindex];
+	hent = &rb_hist_entries[hentno];
+
+	while (hent != RB_INVALID_ENTRY_PTR)
+	{
+		int32		thisoff;
+		int32		thislen;
+		int32		maxlen;
+
+		thislen = 0;
+		hp = hent->ck_start_pos;
+		ip = input_chunk_start;
+
+		maxlen = PGRB_MAX_MATCH;
+		if (hend - hp < maxlen)
+			maxlen = hend - hp;
+
+		thisoff = hend - hent->ck_start_pos;
+		if (thisoff >= 0x0fff)
+			break;
+
+		/*
+		 * first try to match uptil chunksize and if the data is
+		 * same for chunk, then try to match further to get the
+		 * larger match. if there is a match at end of chunk, it
+		 * is possible that further bytes in string will match.
+		 */
+		while (*ip == *hp && thislen < maxlen)
+		{
+			history_chunk_size = PGRB_CHUNK_SIZE;
+			while (history_chunk_size > 0)
+			{
+				if (*hp++ != *ip++)
+					break;
+				else
+					--history_chunk_size;
+			}
+
+			/* consider only complete chunk matches. */
+			if (history_chunk_size == 0)
+				thislen += PGRB_CHUNK_SIZE;
+		}
+		
+		/*
+		 * Remember this match as the best (if it is)
+		 */
+		if (thislen > len)
+		{
+			len = thislen;
+			off = thisoff;
+		}
+
+		/*
+		 * Advance to the next history entry
+		 */
+		hent = hent->next;	
+	}
+
+	/*
+	 * Return match information only if it results at least in one chunk
+	 * reduction.
+	 */
+	if (len >= PGRB_CHUNK_SIZE)
+	{
+		*lenp = len;
+		*offp = off;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
+
+/* ----------
+ * Rabin's Delta Encoding - Compresses source data by referring history.
+ *							
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *dp_unmatched_chunk_start;
+	const char *dp_unmatched_start;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		unmatched_data = false;
+	int16		match_len = 0;
+	int16		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		result_min;
+	int32		need_rate;
+	int			hist_next = 1;
+	int			hashsz;
+	int			mask;
+	int32		hindex;
+	int16		skip_bytes = 4;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGRB_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+		result_min = (slen / 100) * need_rate;
+	}
+	else
+	{
+		result_max = (slen * (100 - need_rate)) / 100;
+		result_min = (slen * (need_rate)) / 100;
+	}
+
+	hashsz = choose_hash_size(hlen/PGRB_CHUNK_SIZE);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * rb_hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, 0, hashsz * sizeof(int16));
+
+	/*
+	 * Form the history table using fixed PGRB_CHUNK_SIZE bytes.
+	 */
+	while (hp < hend - 4)
+	{
+		hindex = pgrb_hist_idx(hp, mask);
+		pgrb_hist_add_no_recycle(hist_start, rb_hist_entries,
+								 hist_next, hp, hindex);
+		hp += PGRB_CHUNK_SIZE;
+	}
+
+	/*
+	 * Loop through the input. We do this in two passes, in first pass
+	 * it loops till first match is found and after that whole tuple
+	 * is processed in second pass. This is to optimize the encoding so
+	 * that we don't need to copy any unmatched bytes till we find a
+	 * match.
+	 */
+	match_off = 0;
+	dp_unmatched_chunk_start = dp;
+	dp_unmatched_start = dp;
+	pgrb_hash_init(dp, hindex);
+	/* Pass - 1 */
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we don't find any match till result minimum size,
+		 * then fall out.
+		 */
+		if (dend - dp <= result_min)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pgrb_hash_roll(dp, hindex);
+
+		if (pgrb_find_match(hist_start, dp,
+							&match_len, &match_off,
+							hend, (hindex & mask)))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters and ensure to copy any unmatched data till
+			 * this point. Currently this code only delays copy of
+			 * unmatched data in begining.
+			 */
+			while (dp_unmatched_start < dp)
+			{
+				pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_unmatched_start);
+				dp_unmatched_start++;
+			}
+			pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			pgrb_hash_unroll(dp, hindex);
+			dp += match_len;
+			break;
+		}
+		pgrb_hash_unroll(dp, hindex);
+
+		/*
+		 * If we don't find any match for first 4 bytes, then skip 4 bytes
+		 * and if we don't find match again for next 8 bytes, then skip 8
+		 * bytes and keep on doing the same until we find first match.
+		 * There is a chance that we miss some bytes for compression, but
+		 * it should not effect much as we are doing this only till we find
+		 * first match.
+		 */
+		if (dp - dp_unmatched_chunk_start >= skip_bytes)
+		{
+			dp += skip_bytes;
+			dp_unmatched_chunk_start = dp;
+			skip_bytes *= 2;
+		}
+		else
+			dp++;
+	}
+
+	/* Pass - 2 */
+	while (dp < dend - 4)
+	{
+		/* If we already exceeded the maximum result size, fail. */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pgrb_hash_roll(dp, hindex);
+
+		if (pgrb_find_match(hist_start, dp,
+							&match_len, &match_off,
+							hend, (hindex & mask)))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			pgrb_hash_unroll(dp, hindex);
+			dp += match_len;
+		}
+		else
+		{
+			/* No match found, copy literal byte into destination buffer. */
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			pgrb_hash_unroll(dp, hindex);
+			dp++;
+		}
+	}
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 47e3022..51d6925 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_rbcompress.h b/src/include/utils/pg_rbcompress.h
new file mode 100644
index 0000000..84f154d
--- /dev/null
+++ b/src/include/utils/pg_rbcompress.h
@@ -0,0 +1,79 @@
+/* ----------
+ * pg_rbcompress.h -
+ *
+ *	Definitions for the builtin Rabin compressor
+ *
+ * src/include/utils/pg_rbcompress.h
+ * ----------
+ */
+
+#ifndef _PG_RBCOMPRESS_H_
+#define _PG_RBCOMPRESS_H_
+
+
+/* ----------
+ * PGRB_Strategy -
+ *
+ *		Some values that control the compression algorithm.
+ *
+ *		min_input_size		Minimum input data size to consider compression.
+ *
+ *		max_input_size		Maximum input data size to consider compression.
+ *
+ *		min_comp_rate		Minimum compression rate (0-99%) to require.
+ *							Regardless of min_comp_rate, the output must be
+ *							smaller than the input, else we don't store
+ *							compressed.
+ *
+ *		first_success_by	Abandon compression if we find no compressible
+ *							data within the first this-many bytes.
+ *
+ *		match_size_good		The initial GOOD match size when starting history
+ *							lookup. When looking up the history to find a
+ *							match that could be expressed as a tag, the
+ *							algorithm does not always walk back entirely.
+ *							A good match fast is usually better than the
+ *							best possible one very late. For each iteration
+ *							in the lookup, this value is lowered so the
+ *							longer the lookup takes, the smaller matches
+ *							are considered good.
+ *
+ *		match_size_drop		The percentage by which match_size_good is lowered
+ *							after each history check. Allowed values are
+ *							0 (no change until end) to 100 (only check
+ *							latest history entry at all).
+ * ----------
+ */
+typedef struct PGRB_Strategy
+{
+	int32		min_input_size;
+	int32		max_input_size;
+	int32		min_comp_rate;
+	int32		first_success_by;
+	int32		match_size_good;
+	int32		match_size_drop;
+} PGRB_Strategy;
+
+
+/* ----------
+ * The standard strategies
+ *
+ *		PGRB_strategy_default		Recommended default strategy for WAL
+ *									compression.
+ * ----------
+ */
+extern const PGRB_Strategy *const PGRB_strategy_default;
+
+
+/* ----------
+ * Global function declarations
+ * ----------
+ */
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGRB_Strategy *strategy);
+extern void pgrb_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
+
+#endif   /* _PG_RBCOMPRESS_H_ */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#77

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#76)

On Thu, Jan 30, 2014 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Few observations in patch (back-to-pglz-like-delta-encoding-1):
1.
+#define pgrb_hash_unroll(_p, hindex) \
+ hindex = hindex ^ ((_p)[0] << 8)
shouldn't it shift by 6 rather than by 8.
2.
+ if (bp - bstart >= result_max)
+ return false;
I think for nothing or lesser match case it will traverse whole tuple.
Can we optimize such that if there is no match till 75%, we can bail out.
Ideally, I think if we don't find any match in first 50 to 60%, we should
come out.

3. pg_rbcompress.h is missing.

I am still working on patch back-to-pglz-like-delta-encoding-1 to see if
it works well for all cases, but thought of sharing what I have done till
now to you.

After basic verification of back-to-pglz-like-delta-encoding-1, I will
take the data with both the patches and report the same.

Apart from above, the only other thing which I found problematic is
below code in find match

+ while (*ip == *hp && thislen < maxlen)
+ thislen++;

It should be
while (*ip++ == *hp++ && thislen < maxlen)

Please confirm.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#76)

2 attachment(s)

On Thu, Jan 30, 2014 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

After basic verification of back-to-pglz-like-delta-encoding-1, I will
take the data with both the patches and report the same.

I have corrected the problems reported in back-to-pglz-like-delta-encoding-1
and removed hindex from pgrb_delta_encoding_v6 and attached are
new versions of both patches.

I/O Reduction Data
-----------------------------
Non-Default settings
autovacuum = off
checkpoitnt_segments = 256
checkpoint_timeout =15min

Unpatched
------------------
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 1054917904 | 14.6407959461212
one short and one long field, no change | 1054917840 | 14.2938411235809
one short and one long field, no change | 1054916032 | 14.6062371730804
hundred tiny fields, all changed | 633950304 | 15.6165988445282
hundred tiny fields, all changed | 633943184 | 15.7330548763275
hundred tiny fields, all changed | 633943536 | 16.2008850574493
hundred tiny fields, half changed | 633946056 | 15.9042718410492
hundred tiny fields, half changed | 633949992 | 15.9494590759277
hundred tiny fields, half changed | 633948448 | 17.1421928405762
hundred tiny fields, half nulled | 569757992 | 16.0392069816589
hundred tiny fields, half nulled | 569758848 | 15.7891688346863
hundred tiny fields, half nulled | 569755144 | 16.2466349601746

Patch pgrb_delta_encoding_v7
------------------------------------------------
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 662240016 | 12.0052649974823
one short and one long field, no change | 662570640 | 11.5202040672302
one short and one long field, no change | 662231656 | 12.2640421390533
hundred tiny fields, all changed | 633947296 | 17.0527350902557
hundred tiny fields, all changed | 633945824 | 17.1216440200806
hundred tiny fields, all changed | 633948904 | 16.8881120681763
hundred tiny fields, half changed | 633944656 | 18.0734100341797
hundred tiny fields, half changed | 633944472 | 17.0183899402618
hundred tiny fields, half changed | 633945112 | 16.6483509540558
hundred tiny fields, half nulled | 499946000 | 18.9340658187866
hundred tiny fields, half nulled | 499952408 | 18.7714779376984
hundred tiny fields, half nulled | 499953432 | 18.690948009491
(12 rows)

Patch back-to-pglz-like-delta-encoding-2
----------------------------------------------------------
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 662242872 | 12.7399699687958
one short and one long field, no change | 662233440 | 12.7010321617126
one short and one long field, no change | 663938992 | 13.1172158718109
hundred tiny fields, all changed | 635451832 | 17.918673992157
hundred tiny fields, all changed | 633946736 | 17.1329951286316
hundred tiny fields, all changed | 633943480 | 17.0818238258362
hundred tiny fields, half changed | 634762208 | 17.0016329288483
hundred tiny fields, half changed | 633946560 | 17.3154718875885
hundred tiny fields, half changed | 633943240 | 17.1657249927521
hundred tiny fields, half nulled | 492017488 | 27.3930599689484
hundred tiny fields, half nulled | 492016776 | 26.7517058849335
hundred tiny fields, half nulled | 493848424 | 26.6423358917236
(12 rows)

Observations
--------------------
1. With both the patches WAL reduction is similar i.e ~37% for
"one short and one long field, no change" and 12% for
"hundred tiny fields, half nulled"
2. With pgrb_delta_encoding_v7, there is ~19% CPU reduction for best
case "one short and one long field, no change".
3. With pgrb_delta_encoding_v7, there is approximately 8~9% overhead
for cases where there is no match
4. With pgrb_delta_encoding_v7, there is approximately 15~18% overhead
for "hundred tiny fields, half nulled" case
5. With back-to-pglz-like-delta-encoding-2, the data is mostly similar except
for "hundred tiny fields, half nulled" where CPU overhead is much more.

The case ("hundred tiny fields, half nulled") where CPU overhead is visible
is due to repetitive data and if take some random or different data, it will not
be there. I think the main reason for overhead is that we store last offset
of matching data in history at front, so during match, it has to traverse back
many times to find longest possible match and in real world it won't be the
case that most of history entries contain same hash index, so it should not
effect.
Finally if any user is concerned much about CPU overhead due to it, there
is a table level knob which he can use to avoid it.

Please let me know your suggestions.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pgrb_delta_encoding_v7.patchapplication/octet-stream; name=pgrb_delta_encoding_v7.patchDownload

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index aea9d40..3bf5728 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..2724188 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b333d82..92c4f00 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2327,6 +2327,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 1ae9fa0..04dc17c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -26,7 +26,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	rowtypes.o regexp.o regproc.o ruleutils.o selfuncs.o \
 	tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \
 	network.o mac.o inet_cidr_ntop.o inet_net_pton.o \
-	ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \
+	ri_triggers.o pg_lzcompress.o pg_rbcompress.o pg_locale.o formatting.o \
 	ascii.o quote.o pgstatfuncs.o encode.o dbsize.o genfile.o trigfuncs.o \
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..57da393
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,621 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is a delta encoding scheme specific to PostgreSQL and designed
+ *		to compress similar tuples. It can be used as it is or extended for
+ *		other purpose in PostgrSQL if required.
+ *
+ *      It uses a simple history table to represent history data (old tuple)
+ *		and then compress source string (new tuple) using the history data.
+ *		It uses LZ compression format to store encoded string except for
+ *		using PGLZ_Header.
+ *
+ *
+ *		The decompression algorithm works exactly same as LZ, except
+ *		for processing of its PGLZ_Header.
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_rbcompress.h"
+
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGRB_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
+#define PGRB_HISTORY_SIZE		4096
+#define PGRB_MAX_MATCH			273
+#define PGRB_CHUNK_SIZE		4
+
+
+
+/* ----------
+ * The provided standard strategies
+ * ----------
+ */
+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	25,							/* Require 25% compression rate, or not worth
+								 * it */
+	1024,						/* Give up if no compression in the first 1KB */
+	128,						/* Stop history lookup if a match of 128 bytes
+								 * is found */
+	10							/* Lower good match size by 10% at every loop
+								 * iteration */
+};
+const PGRB_Strategy *const PGRB_strategy_default = &strategy_default_data;
+
+
+/* ----------
+ * Statically allocated work arrays for history
+ * ----------
+ */
+static int16 hist_start[PGRB_MAX_HISTORY_LISTS];
+static uint16 rb_hist_entries[PGRB_HISTORY_SIZE + 1];
+
+/*
+ * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
+ * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ */
+#define INVALID_ENTRY			0
+
+
+/* ----------
+ * pgrb_hist_idx -
+ *
+ *		Computes the history table slot for the lookup by the next 4
+ *		characters in the input.
+ *
+ * NB: because we use the next 4 characters, we are not guaranteed to
+ * find 3-character matches; they very possibly will be in the wrong
+ * hash list.  This seems an acceptable tradeoff for spreading out the
+ * hash keys more.
+ * ----------
+ */
+#define pgrb_hist_idx(_s, _mask) (										\
+			  (((_s)[0] << 6) ^ ((_s)[1] << 4) ^								\
+			  ((_s)[2] << 2) ^ (_s)[3]) & (_mask)				\
+		)
+
+/* The same, calculated in a rolling fashion */
+#define pgrb_hash_init(_p,hindex)								\
+	hindex = ((_p)[0] << 4) ^ ((_p)[1] << 2) ^ ((_p)[2])
+
+#define pgrb_hash_roll(_p, hindex)								\
+	hindex = (hindex << 2) ^ (_p)[3]
+
+#define pgrb_hash_unroll(_p, hindex)								\
+	hindex = hindex ^ ((_p)[0] << 6)
+
+
+
+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_find_match -
+ *
+ * ----------
+ */
+static inline int
+pgrb_find_match(int16 *hstart, const char *hbegin,
+				const char* input_chunk_start,
+				int16 *lenp, int16 *offp,
+				const char *hend, int hindex)
+{
+	const char  *hp;
+	const char  *ip;
+	int16		history_chunk_size;
+	int32		len = 0;
+	int32		off = 0;
+	int16		hoff;
+
+
+	hoff = hstart[hindex];
+	while (hoff != INVALID_ENTRY)
+	{
+		int32		thisoff;
+		int32		thislen;
+		int32		maxlen;
+
+		thislen = 0;
+
+		/* We use offset 0 to mean invalid entries, the actual offset is +1 */
+		hp = hbegin + hoff - 1;
+		ip = input_chunk_start;
+
+		maxlen = PGRB_MAX_MATCH;
+		if (hend - hp < maxlen)
+			maxlen = hend - hp;
+
+		thisoff = hend - hp;
+		if (thisoff >= 0x0fff)
+			break;
+
+		/*
+		 * first try to match uptil chunksize and if the data is
+		 * same for chunk, then try to match further to get the
+		 * larger match. if there is a match at end of chunk, it
+		 * is possible that further bytes in string will match.
+		 */
+		while (*ip == *hp && thislen < maxlen)
+		{
+			history_chunk_size = PGRB_CHUNK_SIZE;
+			while (history_chunk_size > 0)
+			{
+				if (*hp++ != *ip++)
+					break;
+				else
+					--history_chunk_size;
+			}
+
+			/* consider only complete chunk matches. */
+			if (history_chunk_size == 0)
+				thislen += PGRB_CHUNK_SIZE;
+		}
+		
+		/*
+		 * Remember this match as the best (if it is)
+		 */
+		if (thislen > len)
+		{
+			len = thislen;
+			off = thisoff;
+		}
+
+		/*
+		 * Advance to the next history entry
+		 */
+		hoff = rb_hist_entries[hoff];
+	}
+
+	/*
+	 * Return match information only if it results at least in one chunk
+	 * reduction.
+	 */
+	if (len >= PGRB_CHUNK_SIZE)
+	{
+		*lenp = len;
+		*offp = off;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
+
+/* ----------
+ * Rabin's Delta Encoding - Compresses source data by referring history.
+ *							
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *dp_unmatched_chunk_start;
+	const char *dp_unmatched_start;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		unmatched_data = false;
+	int16		match_len = 0;
+	int16		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		result_min;
+	int32		need_rate;
+	int			hist_next = 1;
+	int			hashsz;
+	int			mask;
+	int32		hindex;
+	int16		skip_bytes = 4;
+	int			hoff;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGRB_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+		result_min = (slen / 100) * need_rate;
+	}
+	else
+	{
+		result_max = (slen * (100 - need_rate)) / 100;
+		result_min = (slen * (need_rate)) / 100;
+	}
+
+	hashsz = choose_hash_size(hlen/PGRB_CHUNK_SIZE);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * rb_hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, INVALID_ENTRY, hashsz * sizeof(int16));
+
+	/*
+	 * Form the history table using fixed PGRB_CHUNK_SIZE bytes.
+	 */
+	hoff = 1;
+	while (hp < hend - 4)
+	{
+		hindex = pgrb_hist_idx(hp, mask);
+
+		/* add this offset to the history table */
+		rb_hist_entries[hoff] = hist_start[hindex];
+		hist_start[hindex] = hoff;
+
+		hoff += PGRB_CHUNK_SIZE;
+		hp += PGRB_CHUNK_SIZE;
+	}
+
+	/*
+	 * Loop through the input. We do this in two passes, in first pass
+	 * it loops till first match is found and after that whole tuple
+	 * is processed in second pass. This is to optimize the encoding so
+	 * that we don't need to copy any unmatched bytes till we find a
+	 * match.
+	 */
+	match_off = 0;
+	dp_unmatched_chunk_start = dp;
+	dp_unmatched_start = dp;
+	pgrb_hash_init(dp, hindex);
+	/* Pass - 1 */
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we don't find any match till result minimum size,
+		 * then fall out.
+		 */
+		if (dend - dp <= result_min)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pgrb_hash_roll(dp, hindex);
+
+		if (pgrb_find_match(hist_start, history, dp,
+							&match_len, &match_off,
+							hend, (hindex & mask)))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters and ensure to copy any unmatched data till
+			 * this point. Currently this code only delays copy of
+			 * unmatched data in begining.
+			 */
+			while (dp_unmatched_start < dp)
+			{
+				pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_unmatched_start);
+				dp_unmatched_start++;
+			}
+			pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			pgrb_hash_unroll(dp, hindex);
+			dp += match_len;
+			break;
+		}
+		pgrb_hash_unroll(dp, hindex);
+
+		/*
+		 * If we don't find any match for first 4 bytes, then skip 4 bytes
+		 * and if we don't find match again for next 8 bytes, then skip 8
+		 * bytes and keep on doing the same until we find first match.
+		 * There is a chance that we miss some bytes for compression, but
+		 * it should not effect much as we are doing this only till we find
+		 * first match.
+		 */
+		if (dp - dp_unmatched_chunk_start >= skip_bytes)
+		{
+			dp += skip_bytes;
+			dp_unmatched_chunk_start = dp;
+			skip_bytes *= 2;
+		}
+		else
+			dp++;
+	}
+
+	/* Pass - 2 */
+	while (dp < dend - 4)
+	{
+		/* If we already exceeded the maximum result size, fail. */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pgrb_hash_roll(dp, hindex);
+
+		if (pgrb_find_match(hist_start, history, dp,
+							&match_len, &match_off,
+							hend, (hindex & mask)))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			pgrb_hash_unroll(dp, hindex);
+			dp += match_len;
+		}
+		else
+		{
+			/* No match found, copy literal byte into destination buffer. */
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			pgrb_hash_unroll(dp, hindex);
+			dp++;
+		}
+	}
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 47e3022..51d6925 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_rbcompress.h b/src/include/utils/pg_rbcompress.h
new file mode 100644
index 0000000..d5105db
--- /dev/null
+++ b/src/include/utils/pg_rbcompress.h
@@ -0,0 +1,79 @@
+/* ----------
+ * pg_rbcompress.h -
+ *
+ *	Definitions for the PostgreSQL specific encoding scheme
+ *
+ * src/include/utils/pg_rbcompress.h
+ * ----------
+ */
+
+#ifndef _PG_RBCOMPRESS_H_
+#define _PG_RBCOMPRESS_H_
+
+
+/* ----------
+ * PGRB_Strategy -
+ *
+ *		Some values that control the compression algorithm.
+ *
+ *		min_input_size		Minimum input data size to consider compression.
+ *
+ *		max_input_size		Maximum input data size to consider compression.
+ *
+ *		min_comp_rate		Minimum compression rate (0-99%) to require.
+ *							Regardless of min_comp_rate, the output must be
+ *							smaller than the input, else we don't store
+ *							compressed.
+ *
+ *		first_success_by	Abandon compression if we find no compressible
+ *							data within the first this-many bytes.
+ *
+ *		match_size_good		The initial GOOD match size when starting history
+ *							lookup. When looking up the history to find a
+ *							match that could be expressed as a tag, the
+ *							algorithm does not always walk back entirely.
+ *							A good match fast is usually better than the
+ *							best possible one very late. For each iteration
+ *							in the lookup, this value is lowered so the
+ *							longer the lookup takes, the smaller matches
+ *							are considered good.
+ *
+ *		match_size_drop		The percentage by which match_size_good is lowered
+ *							after each history check. Allowed values are
+ *							0 (no change until end) to 100 (only check
+ *							latest history entry at all).
+ * ----------
+ */
+typedef struct PGRB_Strategy
+{
+	int32		min_input_size;
+	int32		max_input_size;
+	int32		min_comp_rate;
+	int32		first_success_by;
+	int32		match_size_good;
+	int32		match_size_drop;
+} PGRB_Strategy;
+
+
+/* ----------
+ * The standard strategies
+ *
+ *		PGRB_strategy_default		Recommended default strategy for WAL
+ *									compression.
+ * ----------
+ */
+extern const PGRB_Strategy *const PGRB_strategy_default;
+
+
+/* ----------
+ * Global function declarations
+ * ----------
+ */
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGRB_Strategy *strategy);
+extern void pgrb_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
+
+#endif   /* _PG_RBCOMPRESS_H_ */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

back-to-pglz-like-delta-encoding-2.patchapplication/octet-stream; name=back-to-pglz-like-delta-encoding-2.patchDownload

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index aea9d40..3bf5728 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..2724188 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b333d82..92c4f00 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2327,6 +2327,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 1ae9fa0..04dc17c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -26,7 +26,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	rowtypes.o regexp.o regproc.o ruleutils.o selfuncs.o \
 	tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \
 	network.o mac.o inet_cidr_ntop.o inet_net_pton.o \
-	ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \
+	ri_triggers.o pg_lzcompress.o pg_rbcompress.o pg_locale.o formatting.o \
 	ascii.o quote.o pgstatfuncs.o encode.o dbsize.o genfile.o trigfuncs.o \
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..9e72481
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,522 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is a delta encoding scheme. For each "old" input offset, it
+ *		calculates a hash of the next four bytes, and stores the offset in
+ *		lookup table indexed by the hash. Then, the new input is scanned,
+ *		looking for matches in the old input, using the lookup table.
+ *		Output format is almost identical to the LZ compression format used
+ *		in pg_lzcompress.c
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_rbcompress.h"
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGRB_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
+#define PGRB_HISTORY_SIZE		4096
+#define PGRB_MAX_MATCH			273
+
+#define PGRB_MIN_CHUNK_SIZE		4
+
+
+
+/* ----------
+ * The provided standard strategies
+ * ----------
+ */
+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	25,							/* Require 25% compression rate, or not worth
+								 * it */
+	1024,						/* Give up if no compression in the first 1KB */
+	128,						/* Stop history lookup if a match of 128 bytes
+								 * is found */
+	10							/* Lower good match size by 10% at every loop
+								 * iteration */
+};
+const PGRB_Strategy *const PGRB_strategy_default = &strategy_default_data;
+
+/* ----------
+ * Statically allocated work arrays for history
+ * ----------
+ */
+static int16 hist_start[PGRB_MAX_HISTORY_LISTS];
+static uint16 rb_hist_entries[PGRB_HISTORY_SIZE + 1];
+
+/*
+ * Element 0 in hist_entries is unused, and means 'invalid'.
+ */
+#define INVALID_ENTRY			0
+
+
+/*
+ * Hash function, same as that used in pg_lzcompress.c.
+ */
+#define pgrb_hash_calc(_p,hindex)								\
+	hindex = ((_p)[0] << 6) ^ ((_p)[1] << 4) ^ ((_p)[2] << 2) ^ ((_p)[3])
+
+/* The same, calculated in a rolling fashion */
+#define pgrb_hash_init(_p,hindex)								\
+	hindex = ((_p)[0] << 4) ^ ((_p)[1] << 2) ^ ((_p)[2])
+
+#define pgrb_hash_roll(_p, hindex)								\
+	hindex = (hindex << 2) ^ (_p)[3]
+
+#define pgrb_hash_unroll(_p, hindex)								\
+	hindex = hindex ^ ((_p)[0] << 6)
+
+
+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_find_match -
+ *
+ * ----------
+ */
+static inline int
+pgrb_find_match(int16 *hstart, const char *hbegin,
+				const char* input_chunk_start,
+				int16 *lenp, int16 *offp,
+				const char *hend, int hindex)
+{
+	int16		hoff;
+	const char  *hp;
+	const char  *ip;
+	int32		len = 0;
+	int32		off = 0;
+
+	hoff = hstart[hindex];
+	while (hoff != INVALID_ENTRY)
+	{
+		int32		thisoff;
+		int32		thislen;
+		int32		maxlen;
+
+		/* We use offset 0 to mean invalid entries, the actual offset is +1 */
+		hp = hbegin + hoff - 1;
+		ip = input_chunk_start;
+
+		maxlen = PGRB_MAX_MATCH;
+		if (hend - hp < maxlen)
+			maxlen = hend - hp;
+
+		thisoff = hend - hp;
+		if (thisoff >= 0x0fff)
+			break;
+
+		/* How long is the match? */
+		thislen = 0;
+		while (*ip++ == *hp++ && thislen < maxlen)
+			   thislen++;
+
+		/*
+		 * Remember this match as the best (if it is)
+		 */
+		if (thislen > len)
+		{
+			len = thislen;
+			off = thisoff;
+		}
+
+		/*
+		 * Advance to the next history entry
+		 */
+		hoff = rb_hist_entries[hoff];
+	}
+
+	/*
+	 * Return match information only if it results at least in one chunk
+	 * reduction.
+	 */
+	if (len >= PGRB_MIN_CHUNK_SIZE)
+	{
+		*lenp = len;
+		*offp = off;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
+
+/* ----------
+ * pgrb_delta_decode - Compresses source data by referring history.
+ *
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int16		match_len = 0;
+	int16		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		need_rate;
+	int			hoff;
+	int			hashsz;
+	int			mask;
+	int32		hindex;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGRB_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+		result_max = (slen * (100 - need_rate)) / 100;
+
+	hashsz = choose_hash_size(hlen);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * rb_hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, INVALID_ENTRY, hashsz * sizeof(int16));
+
+	/* Populate the history lists */
+	pgrb_hash_init(hp, hindex);
+	hoff = 0;
+	while (hp < hend - 4)
+	{
+		pgrb_hash_roll(hp, hindex);
+
+		/* add this offset to the history table */
+		hoff++;
+		rb_hist_entries[hoff] = hist_start[hindex & mask];
+		hist_start[hindex & mask] = hoff;
+
+		pgrb_hash_unroll(hp, hindex);
+		hp++;
+	}
+
+	/*
+	 * Loop through the input.
+	 */
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we already exceeded the maximum result size, fail.
+		 *
+		 * We check once per loop; since the loop body could emit as many as 4
+		 * bytes (a control byte and 3-byte tag), PGLZ_MAX_OUTPUT() had better
+		 * allow 4 slop bytes.
+		 */
+		if (bp - bstart >= result_max)
+			return false;
+
+		pgrb_hash_calc(dp, hindex);
+
+		/*
+		 * Try to find a match in the history
+		 */
+		if (pgrb_find_match(hist_start, history, dp,
+							&match_len, &match_off,
+							hend, (hindex & mask)))
+		{
+			pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			dp += match_len;
+			found_match = true;
+		}
+		else
+		{
+			/*
+			 * No match found. To save effort, we only check for matches every
+			 * four bytes. This naturally reduces the compression rate
+			 * somewhat, but we prefer speed over compression rate.
+			 */
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			dp++;
+		}
+	}
+
+	if (!found_match)
+		return false;
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ *		To decompress, it uses history if provided.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 47e3022..51d6925 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_rbcompress.h b/src/include/utils/pg_rbcompress.h
new file mode 100644
index 0000000..84f154d
--- /dev/null
+++ b/src/include/utils/pg_rbcompress.h
@@ -0,0 +1,79 @@
+/* ----------
+ * pg_rbcompress.h -
+ *
+ *	Definitions for the builtin Rabin compressor
+ *
+ * src/include/utils/pg_rbcompress.h
+ * ----------
+ */
+
+#ifndef _PG_RBCOMPRESS_H_
+#define _PG_RBCOMPRESS_H_
+
+
+/* ----------
+ * PGRB_Strategy -
+ *
+ *		Some values that control the compression algorithm.
+ *
+ *		min_input_size		Minimum input data size to consider compression.
+ *
+ *		max_input_size		Maximum input data size to consider compression.
+ *
+ *		min_comp_rate		Minimum compression rate (0-99%) to require.
+ *							Regardless of min_comp_rate, the output must be
+ *							smaller than the input, else we don't store
+ *							compressed.
+ *
+ *		first_success_by	Abandon compression if we find no compressible
+ *							data within the first this-many bytes.
+ *
+ *		match_size_good		The initial GOOD match size when starting history
+ *							lookup. When looking up the history to find a
+ *							match that could be expressed as a tag, the
+ *							algorithm does not always walk back entirely.
+ *							A good match fast is usually better than the
+ *							best possible one very late. For each iteration
+ *							in the lookup, this value is lowered so the
+ *							longer the lookup takes, the smaller matches
+ *							are considered good.
+ *
+ *		match_size_drop		The percentage by which match_size_good is lowered
+ *							after each history check. Allowed values are
+ *							0 (no change until end) to 100 (only check
+ *							latest history entry at all).
+ * ----------
+ */
+typedef struct PGRB_Strategy
+{
+	int32		min_input_size;
+	int32		max_input_size;
+	int32		min_comp_rate;
+	int32		first_success_by;
+	int32		match_size_good;
+	int32		match_size_drop;
+} PGRB_Strategy;
+
+
+/* ----------
+ * The standard strategies
+ *
+ *		PGRB_strategy_default		Recommended default strategy for WAL
+ *									compression.
+ * ----------
+ */
+extern const PGRB_Strategy *const PGRB_strategy_default;
+
+
+/* ----------
+ * Global function declarations
+ * ----------
+ */
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGRB_Strategy *strategy);
+extern void pgrb_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
+
+#endif   /* _PG_RBCOMPRESS_H_ */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#79

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#78)

1 attachment(s)

On Fri, Jan 31, 2014 at 12:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 30, 2014 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

After basic verification of back-to-pglz-like-delta-encoding-1, I will
take the data with both the patches and report the same.

I have corrected the problems reported in back-to-pglz-like-delta-encoding-1
and removed hindex from pgrb_delta_encoding_v6 and attached are
new versions of both patches.

I/O Reduction Data
-----------------------------
Non-Default settings
autovacuum = off
checkpoitnt_segments = 256
checkpoint_timeout =15min

Observations
--------------------
1. With both the patches WAL reduction is similar i.e ~37% for
"one short and one long field, no change" and 12% for
"hundred tiny fields, half nulled"
2. With pgrb_delta_encoding_v7, there is ~19% CPU reduction for best
case "one short and one long field, no change".
3. With pgrb_delta_encoding_v7, there is approximately 8~9% overhead
for cases where there is no match
4. With pgrb_delta_encoding_v7, there is approximately 15~18% overhead
for "hundred tiny fields, half nulled" case
5. With back-to-pglz-like-delta-encoding-2, the data is mostly similar except
for "hundred tiny fields, half nulled" where CPU overhead is much more.

The case ("hundred tiny fields, half nulled") where CPU overhead is visible
is due to repetitive data and if take some random or different data, it will not
be there.

To verify this theory, I have added one new test which is almost similar to
"hundred tiny fields, half nulled", the difference is that it has
non-repetive string
and the results are as below:

Unpatch
--------------
testname | wal_generated |
duration
------------------------------------------------------+---------------+------------------
nine short and one long field, thirty percent change | 698912496
| 12.1819660663605
nine short and one long field, thirty percent change | 698906048
| 11.9409539699554
nine short and one long field, thirty percent change | 698910904
| 11.9367880821228

Patch pgrb_delta_encoding_v7
------------------------------------------------

testname | wal_generated
| duration
------------------------------------------------------+---------------+------------------
nine short and one long field, thirty percent change | 559840840
| 11.6027710437775
nine short and one long field, thirty percent change | 559829440
| 11.8239741325378
nine short and one long field, thirty percent change | 560141352
| 11.6789472103119

Patch back-to-pglz-like-delta-encoding-2
----------------------------------------------------------

Basic idea of new test is that some part of tuple is unchanged and
other part is changed, here the unchanged part contains random string
rather than repetitive set of chars.
The new test is added with other tests in attached file.

Observation
-------------------
LZ like delta encoding has more WAL reduction and chunk wise encoding
has bit better CPU usage, but overall both are almost similar.

I think the main reason for overhead is that we store last offset
of matching data in history at front, so during match, it has to traverse back
many times to find longest possible match and in real world it won't be the
case that most of history entries contain same hash index, so it should not
effect.

If we want to improve CPU usage for cases like "hundred tiny fields,
half nulled"
(which I think is not important), forming history table by traversing from end
rather than beginning, can serve the purpose, I have not tried it but I think
it can certainly help.

Do you think overall data is acceptable?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#80

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#79)

1 attachment(s)

On Fri, Jan 31, 2014 at 1:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 31, 2014 at 12:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jan 30, 2014 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

After basic verification of back-to-pglz-like-delta-encoding-1, I will
take the data with both the patches and report the same.

I have corrected the problems reported in back-to-pglz-like-delta-encoding-1
and removed hindex from pgrb_delta_encoding_v6 and attached are
new versions of both patches.

I/O Reduction Data
-----------------------------
Non-Default settings
autovacuum = off
checkpoitnt_segments = 256
checkpoint_timeout =15min

Observations
--------------------
1. With both the patches WAL reduction is similar i.e ~37% for
"one short and one long field, no change" and 12% for
"hundred tiny fields, half nulled"
2. With pgrb_delta_encoding_v7, there is ~19% CPU reduction for best
case "one short and one long field, no change".
3. With pgrb_delta_encoding_v7, there is approximately 8~9% overhead
for cases where there is no match
4. With pgrb_delta_encoding_v7, there is approximately 15~18% overhead
for "hundred tiny fields, half nulled" case
5. With back-to-pglz-like-delta-encoding-2, the data is mostly similar except
for "hundred tiny fields, half nulled" where CPU overhead is much more.

I think the main reason for overhead is that we store last offset
of matching data in history at front, so during match, it has to traverse back
many times to find longest possible match and in real world it won't be the
case that most of history entries contain same hash index, so it should not
effect.

If we want to improve CPU usage for cases like "hundred tiny fields,
half nulled"
(which I think is not important), forming history table by traversing from end
rather than beginning, can serve the purpose, I have not tried it but I think
it can certainly help.

I had implemented the above idea of forming the history table by traversing
the old tuple from end instead of from beginning and had done some
optimizations in find match for breaking the loop based on good match
concept similar to pglz. The advantage of this is that we can find longer
matches quickly and due to which even for case "hundred tiny fields,
half nulled", now there is no CPU overhead without having any
significant effect on any other case.

Please find the updated patch attached with mail and new
data as below:

Non-Default settings
---------------------------------
autovacuum = off
checkpoitnt_segments = 256
checkpoint_timeout =15min

Unpatched

testname | wal_generated |
duration

------------------------------------------------------+---------------+------------------
one short and one long field, no change | 1055025424
| 14.3506939411163
one short and one long field, no change | 1056580160
| 18.1261160373688
one short and one long field, no change | 1054914792
| 15.104973077774
hundred tiny fields, all changed |
636948992 | 16.3172590732574
hundred tiny fields, all changed |
633943680 | 16.308168888092
hundred tiny fields, all changed |
636516776 | 16.4316298961639
hundred tiny fields, half changed |
633948288 | 16.5795118808746
hundred tiny fields, half changed |
636068648 | 16.2913551330566
hundred tiny fields, half changed |
635848432 | 15.9602961540222
hundred tiny fields, half nulled |
569758744 | 15.9501180648804
hundred tiny fields, half nulled |
569760112 | 15.9422838687897
hundred tiny fields, half nulled |
570609712 | 16.5659689903259
nine short and one long field, thirty % change | 698908824 |
12.7938749790192
nine short and one long field, thirty % change | 698905400 |
12.0160901546478
nine short and one long field, thirty % change | 698909720 |
12.2999179363251

After pgrb_delta_encoding_v8.patch
----------------------------------------------------------
testname | wal_generated
| duration
------------------------------------------------------+---------------+------------------
one short and one long field, no change | 680203392
| 12.4820687770844
one short and one long field, no change | 677340120
| 11.8634090423584
one short and one long field, no change | 677333288
| 11.9269840717316
hundred tiny fields, all changed |
633950264 | 16.7694170475006
hundred tiny fields, all changed |
635496520 | 16.9294109344482
hundred tiny fields, all changed |
633942832 | 18.0690770149231
hundred tiny fields, half changed |
633948024 | 17.0814690589905
hundred tiny fields, half changed |
633947488 | 17.0073189735413
hundred tiny fields, half changed |
633949224 | 17.0454230308533
hundred tiny fields, half nulled |
499950184 | 16.3303508758545
hundred tiny fields, half nulled |
499952888 | 15.7197980880737
hundred tiny fields, half nulled |
499958120 | 15.7198679447174
nine short and one long field, thirty % change | 559831384 |
12.0672481060028
nine short and one long field, thirty % change | 559829472 |
11.8555760383606
nine short and one long field, thirty % change | 559832760 |
11.9470820426941

Observations are almost similar as previous except for
"hundred tiny fields, half nulled" case which I have updated below:

Observations
--------------------
1. With both the patches WAL reduction is similar i.e ~37% for
"one short and one long field, no change" and 12% for
"hundred tiny fields, half nulled"
2. With pgrb_delta_encoding_v7, there is ~19% CPU reduction for best
case "one short and one long field, no change".
3. With pgrb_delta_encoding_v7, there is approximately 8~9% overhead
for cases where there is no match
4. With pgrb_delta_encoding_v7, there is approximately 15~18% overhead
for "hundred tiny fields, half nulled" case

Now there is approximately 1.4~5% CPU gain for
"hundred tiny fields, half nulled" case

5. With back-to-pglz-like-delta-encoding-2, the data is mostly similar except
for "hundred tiny fields, half nulled" where CPU overhead is much more.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pgrb_delta_encoding_v8.patchapplication/octet-stream; name=pgrb_delta_encoding_v8.patchDownload

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index aea9d40..3bf5728 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..2724188 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f63185..5d874eb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2332,6 +2332,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 1ae9fa0..04dc17c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -26,7 +26,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	rowtypes.o regexp.o regproc.o ruleutils.o selfuncs.o \
 	tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \
 	network.o mac.o inet_cidr_ntop.o inet_net_pton.o \
-	ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \
+	ri_triggers.o pg_lzcompress.o pg_rbcompress.o pg_locale.o formatting.o \
 	ascii.o quote.o pgstatfuncs.o encode.o dbsize.o genfile.o trigfuncs.o \
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..26cd293
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,699 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is a delta encoding scheme specific to PostgreSQL and designed
+ *		to compress similar tuples. It can be used as it is or extended for
+ *		other purpose in PostgrSQL if required.
+ *
+ *      It uses a simple history table to represent history data (old tuple)
+ *		and then compress source string (new tuple) using the history data.
+ *		It uses LZ compression format to store encoded string except for
+ *		using PGLZ_Header.
+ *
+ *
+ *		The decompression algorithm works exactly same as LZ, except
+ *		for processing of its PGLZ_Header.
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_rbcompress.h"
+
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGRB_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
+#define PGRB_HISTORY_SIZE		4096
+#define PGRB_MAX_MATCH			273
+#define PGRB_CHUNK_SIZE		4
+
+
+
+/* ----------
+ * The provided standard strategies
+ * ----------
+ */
+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	25,							/* Require 25% compression rate, or not worth
+								 * it */
+	1024,						/* Give up if no compression in the first 1KB */
+	128,						/* Stop history lookup if a match of 128 bytes
+								 * is found */
+	50							/* Lower good match size by 50% at every loop
+								 * iteration */
+};
+const PGRB_Strategy *const PGRB_strategy_default = &strategy_default_data;
+
+
+/* ----------
+ * Statically allocated work arrays for history
+ * ----------
+ */
+static int16 hist_start[PGRB_MAX_HISTORY_LISTS];
+static uint16 rb_hist_entries[PGRB_HISTORY_SIZE + 1];
+
+/*
+ * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
+ * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ */
+#define INVALID_ENTRY			0
+
+
+/* ----------
+ * pgrb_hist_idx -
+ *
+ *		Computes the history table slot for the lookup by the next 4
+ *		characters in the input.
+ *
+ * NB: because we use the next 4 characters, we are not guaranteed to
+ * find 3-character matches; they very possibly will be in the wrong
+ * hash list.  This seems an acceptable tradeoff for spreading out the
+ * hash keys more.
+ * ----------
+ */
+#define pgrb_hist_idx(_s, _pos, _mask) (										\
+			  (((_s)[_pos - 3] << 6) ^ ((_s)[_pos - 2] << 4) ^								\
+			  ((_s)[_pos - 1] << 2) ^ (_s)[_pos]) & (_mask)				\
+		)
+
+
+/* The same, calculated in a rolling fashion */
+#define pgrb_hash_init(_p,hindex)								\
+	hindex = ((_p)[0] << 4) ^ ((_p)[1] << 2) ^ ((_p)[2])
+
+#define pgrb_hash_roll(_p, hindex)								\
+	hindex = (hindex << 2) ^ (_p)[3]
+
+#define pgrb_hash_unroll(_p, hindex)								\
+	hindex = hindex ^ ((_p)[0] << 6)
+
+
+
+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_find_match -
+ *
+ * ----------
+ */
+static inline int
+pgrb_find_match(int16 *hstart, const char *hbegin,
+				const char* input_chunk_start,
+				int16 *lenp, int16 *offp,
+				const char *hend, int hindex,
+				int good_match, int good_drop)
+{
+	const char  *hp;
+	const char  *ip;
+	int16		history_chunk_size;
+	int32		len = 0;
+	int32		off = 0;
+	int16		hoff;
+
+
+	hoff = hstart[hindex];
+
+	/* We use offset 0 to mean invalid entries. */
+	while (hoff != INVALID_ENTRY)
+	{
+		int32		thisoff;
+		int32		thislen;
+		int32		maxlen;
+
+		thislen = 0;
+
+		hp = hbegin + hoff;
+		ip = input_chunk_start;
+
+		maxlen = PGRB_MAX_MATCH;
+		if (hend - hp < maxlen)
+			maxlen = hend - hp;
+
+		/*
+		 * we cannot get a bigger chunk if the maximum length for
+		 * start of match position is less than already found match
+		 * length.
+		 */
+		if (len > maxlen)
+			break;
+
+		thisoff = hend - hp;
+		if (thisoff >= 0x0fff)
+			break;
+
+		/*
+		 * Determine length of match. A better match must be larger than the
+		 * best so far. And if we already have a match of 16 or more bytes,
+		 * it's worth the call overhead to use memcmp() to check if this match
+		 * is equal for the same size. After that we must fallback to
+		 * chunk by chunk comparison to know the exact position where
+		 * the diff occurred.
+		 */
+		if (len >= 16)
+		{
+			if (memcmp(ip, hp, len) == 0)
+			{
+				thislen = len;
+				ip += len;
+				hp += len;
+				while (*ip == *hp && thislen < maxlen)
+				{
+					history_chunk_size = PGRB_CHUNK_SIZE;
+					while (history_chunk_size > 0)
+					{
+						if (*hp++ != *ip++)
+							break;
+						else
+							--history_chunk_size;
+					}
+
+					/* consider only complete chunk matches. */
+					if (history_chunk_size == 0)
+						thislen += PGRB_CHUNK_SIZE;
+				}
+			}
+		}
+		else
+		{
+			while (*ip == *hp && thislen < maxlen)
+			{
+				history_chunk_size = PGRB_CHUNK_SIZE;
+				while (history_chunk_size > 0)
+				{
+					if (*hp++ != *ip++)
+						break;
+					else
+						--history_chunk_size;
+				}
+
+				/* consider only complete chunk matches. */
+				if (history_chunk_size == 0)
+					thislen += PGRB_CHUNK_SIZE;
+			}
+		}
+		
+		/*
+		 * Remember this match as the best (if it is)
+		 */
+		if (thislen > len)
+		{
+			len = thislen;
+			off = thisoff;
+		}
+
+		/*
+		 * Advance to the next history entry
+		 */
+		hoff = rb_hist_entries[hoff];
+
+		/*
+		 * Be happy with lesser good matches the more entries we visited. But
+		 * no point in doing calculation if we're at end of list.
+		 */
+		if (hoff != INVALID_ENTRY)
+		{
+			if (len >= good_match)
+				break;
+			good_match -= (good_match * good_drop) / 100;
+		}
+	}
+
+	/*
+	 * Return match information only if it results at least in one chunk
+	 * reduction.
+	 */
+	if (len >= PGRB_CHUNK_SIZE)
+	{
+		*lenp = len;
+		*offp = off;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
+
+/* ----------
+ * Chunkwise Delta Encoding - Compresses source data by referring history.
+ *							
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *dp_unmatched_chunk_start;
+	const char *dp_unmatched_start;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	const char *hend_tmp;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		unmatched_data = false;
+	int16		match_len = 0;
+	int16		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		result_min;
+	int32		need_rate;
+	int32		good_match;
+	int32		good_drop;
+	int			hist_next = 1;
+	int			hashsz;
+	int			mask;
+	int32		hindex;
+	int16		skip_bytes = 4;
+	int			hoff;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGRB_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGRB_MAX_MATCH)
+		good_match = PGRB_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+		result_min = (slen / 100) * need_rate;
+	}
+	else
+	{
+		result_max = (slen * (100 - need_rate)) / 100;
+		result_min = (slen * (need_rate)) / 100;
+	}
+
+	hashsz = choose_hash_size(hlen/PGRB_CHUNK_SIZE);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * rb_hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, INVALID_ENTRY, hashsz * sizeof(int16));
+
+	/*
+	 * Form the history table using fixed PGRB_CHUNK_SIZE bytes.
+	 */
+	hoff = hlen - 4;
+	hend_tmp = hend - 4;
+	while (hend_tmp - 4 > hp)
+	{
+		hindex = pgrb_hist_idx(hp, hoff, mask);
+
+		/*
+		 * Need to position offset to 4 bytes prior, before storing in
+		 * history table as the hash calculation is done from that location.
+		 */
+		hoff -= PGRB_CHUNK_SIZE;
+
+		/* add this offset to the history table */
+		rb_hist_entries[hoff+1] = hist_start[hindex];
+		hist_start[hindex] = hoff+1;
+
+		hend_tmp -= PGRB_CHUNK_SIZE;
+	}
+
+
+	/*
+	 * Loop through the input. We do this in two passes, in first pass
+	 * it loops till first match is found and after that whole tuple
+	 * is processed in second pass. This is to optimize the encoding so
+	 * that we don't need to copy any unmatched bytes till we find a
+	 * match.
+	 */
+	match_off = 0;
+	dp_unmatched_chunk_start = dp;
+	dp_unmatched_start = dp;
+	pgrb_hash_init(dp, hindex);
+	/* Pass - 1 */
+	while (dp < dend - 4)
+	{
+		/*
+		 * If we don't find any match till result minimum size,
+		 * then fall out.
+		 */
+		if (dend - dp <= result_min)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pgrb_hash_roll(dp, hindex);
+
+		if (pgrb_find_match(hist_start, history, dp,
+							&match_len, &match_off,
+							hend, (hindex & mask),
+							good_match, good_drop))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters and ensure to copy any unmatched data till
+			 * this point. Currently this code only delays copy of
+			 * unmatched data in begining.
+			 */
+			while (dp_unmatched_start < dp)
+			{
+				pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_unmatched_start);
+				dp_unmatched_start++;
+			}
+			pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			pgrb_hash_unroll(dp, hindex);
+			dp += match_len;
+			break;
+		}
+		pgrb_hash_unroll(dp, hindex);
+
+		/*
+		 * If we don't find any match for first 4 bytes, then skip 4 bytes
+		 * and if we don't find match again for next 8 bytes, then skip 8
+		 * bytes and keep on doing the same until we find first match.
+		 * There is a chance that we miss some bytes for compression, but
+		 * it should not effect much as we are doing this only till we find
+		 * first match.
+		 */
+		if (dp - dp_unmatched_chunk_start >= skip_bytes)
+		{
+			dp += skip_bytes;
+			dp_unmatched_chunk_start = dp;
+			skip_bytes *= 2;
+		}
+		else
+			dp++;
+	}
+
+	/* Pass - 2 */
+	while (dp < dend - 4)
+	{
+		/* If we already exceeded the maximum result size, fail. */
+		if (bp - bstart >= result_max)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pgrb_hash_roll(dp, hindex);
+
+		if (pgrb_find_match(hist_start, history, dp,
+							&match_len, &match_off,
+							hend, (hindex & mask),
+							good_match, good_drop))
+		{
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			pgrb_hash_unroll(dp, hindex);
+			dp += match_len;
+		}
+		else
+		{
+			/* No match found, copy literal byte into destination buffer. */
+			pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+			pgrb_hash_unroll(dp, hindex);
+			dp++;
+		}
+	}
+
+	/* Handle the last few bytes as literals */
+	while (dp < dend)
+	{
+		pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 11ab277..87402b0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_rbcompress.h b/src/include/utils/pg_rbcompress.h
new file mode 100644
index 0000000..d5105db
--- /dev/null
+++ b/src/include/utils/pg_rbcompress.h
@@ -0,0 +1,79 @@
+/* ----------
+ * pg_rbcompress.h -
+ *
+ *	Definitions for the PostgreSQL specific encoding scheme
+ *
+ * src/include/utils/pg_rbcompress.h
+ * ----------
+ */
+
+#ifndef _PG_RBCOMPRESS_H_
+#define _PG_RBCOMPRESS_H_
+
+
+/* ----------
+ * PGRB_Strategy -
+ *
+ *		Some values that control the compression algorithm.
+ *
+ *		min_input_size		Minimum input data size to consider compression.
+ *
+ *		max_input_size		Maximum input data size to consider compression.
+ *
+ *		min_comp_rate		Minimum compression rate (0-99%) to require.
+ *							Regardless of min_comp_rate, the output must be
+ *							smaller than the input, else we don't store
+ *							compressed.
+ *
+ *		first_success_by	Abandon compression if we find no compressible
+ *							data within the first this-many bytes.
+ *
+ *		match_size_good		The initial GOOD match size when starting history
+ *							lookup. When looking up the history to find a
+ *							match that could be expressed as a tag, the
+ *							algorithm does not always walk back entirely.
+ *							A good match fast is usually better than the
+ *							best possible one very late. For each iteration
+ *							in the lookup, this value is lowered so the
+ *							longer the lookup takes, the smaller matches
+ *							are considered good.
+ *
+ *		match_size_drop		The percentage by which match_size_good is lowered
+ *							after each history check. Allowed values are
+ *							0 (no change until end) to 100 (only check
+ *							latest history entry at all).
+ * ----------
+ */
+typedef struct PGRB_Strategy
+{
+	int32		min_input_size;
+	int32		max_input_size;
+	int32		min_comp_rate;
+	int32		first_success_by;
+	int32		match_size_good;
+	int32		match_size_drop;
+} PGRB_Strategy;
+
+
+/* ----------
+ * The standard strategies
+ *
+ *		PGRB_strategy_default		Recommended default strategy for WAL
+ *									compression.
+ * ----------
+ */
+extern const PGRB_Strategy *const PGRB_strategy_default;
+
+
+/* ----------
+ * Global function declarations
+ * ----------
+ */
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGRB_Strategy *strategy);
+extern void pgrb_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
+
+#endif   /* _PG_RBCOMPRESS_H_ */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#81

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#80)

On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Now there is approximately 1.4~5% CPU gain for
"hundred tiny fields, half nulled" case

I don't want to advocate too strongly for this patch because, number
one, Amit is a colleague and more importantly, number two, I can't
claim to be an expert on compression. But that having been said, I
think these numbers are starting to look awfully good. The only
remaining regressions are in the cases where a large fraction of the
tuple turns over, and they're not that big even then. The two *worst*
tests now seem to be "hundred tiny fields, all changed" and "hundred
tiny fields, half changed". For the "all changed" case, the median
unpatched time is 16.3172590732574 and the median patched time is
16.9294109344482, a <4% loss; for the "half changed" case, the median
unpatched time is 16.5795118808746 and the median patched time is
17.0454230308533, a <3% loss. Both cases show minimal change in WAL
volume.

Meanwhile, in friendlier cases, like "one short and one long field, no
change", we're seeing big improvements. That particular case shows a
speedup of 21% and a WAL reduction of 36%. That's a pretty big deal,
and I think not unrepresentative of many real-world workloads. Some
might well do better, having either more or longer unchanged fields.
Assuming that the logic isn't buggy, a point in need of further study,
I'm starting to feel like we want to have this. And I might even be
tempted to remove the table-level off switch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82

Bruce Momjian

bruce@momjian.us

almost 12 years ago

In reply to: Robert Haas (#81)

On Tue, Feb 4, 2014 at 01:28:38PM -0500, Robert Haas wrote:

Meanwhile, in friendlier cases, like "one short and one long field, no
change", we're seeing big improvements. That particular case shows a
speedup of 21% and a WAL reduction of 36%. That's a pretty big deal,
and I think not unrepresentative of many real-world workloads. Some
might well do better, having either more or longer unchanged fields.
Assuming that the logic isn't buggy, a point in need of further study,
I'm starting to feel like we want to have this. And I might even be
tempted to remove the table-level off switch.

Does this feature relate to compression of WAL page images at all?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Bruce Momjian (#82)

On 2014-02-04 14:09:57 -0500, Bruce Momjian wrote:

On Tue, Feb 4, 2014 at 01:28:38PM -0500, Robert Haas wrote:

Meanwhile, in friendlier cases, like "one short and one long field, no
change", we're seeing big improvements. That particular case shows a
speedup of 21% and a WAL reduction of 36%. That's a pretty big deal,
and I think not unrepresentative of many real-world workloads. Some
might well do better, having either more or longer unchanged fields.
Assuming that the logic isn't buggy, a point in need of further study,
I'm starting to feel like we want to have this. And I might even be
tempted to remove the table-level off switch.

Does this feature relate to compression of WAL page images at all?

No.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Bruce Momjian

bruce@momjian.us

almost 12 years ago

In reply to: Andres Freund (#83)

On Tue, Feb 4, 2014 at 08:11:18PM +0100, Andres Freund wrote:

On 2014-02-04 14:09:57 -0500, Bruce Momjian wrote:

On Tue, Feb 4, 2014 at 01:28:38PM -0500, Robert Haas wrote:

Meanwhile, in friendlier cases, like "one short and one long field, no
change", we're seeing big improvements. That particular case shows a
speedup of 21% and a WAL reduction of 36%. That's a pretty big deal,
and I think not unrepresentative of many real-world workloads. Some
might well do better, having either more or longer unchanged fields.
Assuming that the logic isn't buggy, a point in need of further study,
I'm starting to feel like we want to have this. And I might even be
tempted to remove the table-level off switch.

Does this feature relate to compression of WAL page images at all?

No.

I guess it bothers me we are working on compressing row change sets
while the majority(?) of WAL is page images. I know we had a page image
compression patch that got stalled.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85

Peter Geoghegan

pg@heroku.com

almost 12 years ago

In reply to: Andres Freund (#83)

On Tue, Feb 4, 2014 at 11:11 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Does this feature relate to compression of WAL page images at all?

No.

So the obvious question is: where, if anywhere, do the two efforts
(this patch, and Fujii's patch) overlap? Does Fujii have any concerns
about this patch as it relates to his effort to compress FPIs?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Peter Geoghegan (#85)

On February 4, 2014 10:50:10 PM CET, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Feb 4, 2014 at 11:11 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

Does this feature relate to compression of WAL page images at all?

No.

So the obvious question is: where, if anywhere, do the two efforts
(this patch, and Fujii's patch) overlap? Does Fujii have any concerns
about this patch as it relates to his effort to compress FPIs?

I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an irrelevant volume.

Andres

--
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Peter Geoghegan

pg@heroku.com

almost 12 years ago

In reply to: Andres Freund (#86)

On Tue, Feb 4, 2014 at 1:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:

I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an irrelevant volume.

I'd have thought so too, but I would not like to assume. Like many
people commenting on this thread, I don't know very much about
compression.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Robert Haas (#81)

1 attachment(s)

On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Now there is approximately 1.4~5% CPU gain for
"hundred tiny fields, half nulled" case

Assuming that the logic isn't buggy, a point in need of further study,
I'm starting to feel like we want to have this. And I might even be
tempted to remove the table-level off switch.

I have tried to stress on worst case more, as you are thinking to
remove table-level switch and found that even if we increase the
data by approx. 8 times ("ten long fields, all changed", each field contains
80 byte data), the CPU overhead is still < 5% which clearly shows that
the overhead doesn't increase much even if the length of unmatched data
is increased by much larger factor.
So the data for worst case adds more weight to your statement
("remove table-level switch"), however there is no harm in keeping
table-level option with default as 'true' and if some users are really sure
the updates in their system will have nothing in common, then they can
make this new option as 'false'.

Below is data for the new case " ten long fields, all changed" added
in attached script file:

Unpatched
testname | wal_generated | duration
------------------------------+---------------+------------------
ten long fields, all changed | 3473999520 | 45.0375978946686
ten long fields, all changed | 3473999864 | 45.2536928653717
ten long fields, all changed | 3474006880 | 45.1887288093567

After pgrb_delta_encoding_v8.patch
----------------------------------------------------------
testname | wal_generated | duration
------------------------------+---------------+------------------
ten long fields, all changed | 3474006456 | 47.5744359493256
ten long fields, all changed | 3474000136 | 47.3830440044403
ten long fields, all changed | 3474002688 | 46.9923310279846

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#89

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Andres Freund (#86)

On 02/04/2014 11:58 PM, Andres Freund wrote:

On February 4, 2014 10:50:10 PM CET, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Feb 4, 2014 at 11:11 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

Does this feature relate to compression of WAL page images at all?

No.

So the obvious question is: where, if anywhere, do the two efforts
(this patch, and Fujii's patch) overlap? Does Fujii have any concerns
about this patch as it relates to his effort to compress FPIs?

I think there's zero overlap. They're completely complimentary features. It's not like normal WAL records have an irrelevant volume.

Correct. Compressing a full-page image happens on the first update after
a checkpoint, and the diff between old and new tuple is not used in that
case.

Compressing full page images makes a difference if you're doing random
updates across a large table, so that you only update each buffer 1-2
times. This patch will have no effect in that case. And when you update
the same page many times between checkpoints, the full-page image is
insignificant, and this patch has a big effect.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Amit Kapila (#88)

1 attachment(s)

On 02/05/2014 07:54 AM, Amit Kapila wrote:

On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Now there is approximately 1.4~5% CPU gain for
"hundred tiny fields, half nulled" case

Assuming that the logic isn't buggy, a point in need of further study,
I'm starting to feel like we want to have this. And I might even be
tempted to remove the table-level off switch.

I have tried to stress on worst case more, as you are thinking to
remove table-level switch and found that even if we increase the
data by approx. 8 times ("ten long fields, all changed", each field contains
80 byte data), the CPU overhead is still < 5% which clearly shows that
the overhead doesn't increase much even if the length of unmatched data
is increased by much larger factor.
So the data for worst case adds more weight to your statement
("remove table-level switch"), however there is no harm in keeping
table-level option with default as 'true' and if some users are really sure
the updates in their system will have nothing in common, then they can
make this new option as 'false'.

Below is data for the new case " ten long fields, all changed" added
in attached script file:

That's not the worst case, by far.

First, note that the skipping while scanning new tuple is only performed
in the first loop. That means that as soon as you have a single match,
you fall back to hashing every byte. So for the worst case, put one
4-byte field as the first column, and don't update it.

Also, I suspect the runtimes in your test were dominated by I/O. When I
scale down the number of rows involved so that the whole test fits in
RAM, I get much bigger differences with and without the patch. You might
also want to turn off full_page_writes, to make the effect clear with
less data.

So, I came up with the attached worst case test, modified from your
latest test suite.

unpatched:

pgrb_delta_encoding_v8.patch:

So with this test, the overhead is very significant.

With the skipping logic, another kind of "worst case" case is that you
have a lot of similarity between the old and new tuple, but you miss it
because you skip. For example, if you change the first few columns, but
leave a large text column at the end of the tuple unchanged.

- Heikki

#91

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Amit Kapila (#76)

1 attachment(s)

On 01/30/2014 08:53 AM, Amit Kapila wrote:

On Wed, Jan 29, 2014 at 8:13 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 01/29/2014 02:21 PM, Amit Kapila wrote:

The main reason to process in chunks as much as possible is to save
cpu cycles. For example if we build hash table byte-by-byte, then even
for best case where most of tuple has a match, it will have reasonable
overhead due to formation of hash table.

Hmm. One very simple optimization we could do is to just compare the two
strings byte by byte, before doing anything else, to find any common prefix
they might have. Then output a tag for the common prefix, and run the normal
algorithm on the rest of the strings. In many real-world tables, the 1-2
first columns are a key that never changes, so that might work pretty well
in practice. Maybe it would also be worthwhile to do the same for any common
suffix the tuples might have.

Is it possible to do for both prefix and suffix together, basically
the question I
have in mind is what will be deciding factor for switching from hash table
mechanism to string comparison mode for suffix. Do we switch when we find
long enough match?

I think you got it backwards. You don't switch from hash table mechanism
to string comparison. You do the prefix/suffix comparison *first*, and
run the hash table algorithm only on the "middle" part, between the
common prefix and suffix.

Can we do this optimization after the basic version is acceptable?

I would actually suggest doing that first. Perhaps even ditch the whole
history table approach and do *only* the scan for prefix and suffix.
That's very cheap, and already covers a large fraction of UPDATEs that
real applications do. In particular, it's optimal for the case that you
update only a single column, something like "UPDATE foo SET bar = bar + 1".

I'm pretty sure the overhead of that would be negligible, so we could
always enable it. There are certainly a lot of scenarios where
prefix/suffix detection alone wouldn't help, but so what.

Attached is a quick patch for that, if you want to test it.

- Heikki

Attachments:

wal-update-prefix-suffix-encode-1.patchtext/x-diff; name=wal-update-prefix-suffix-encode-1.patchDownload

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index aea9d40..3bf5728 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..2724188 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b333d82..92c4f00 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2327,6 +2327,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 1ae9fa0..04dc17c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -26,7 +26,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	rowtypes.o regexp.o regproc.o ruleutils.o selfuncs.o \
 	tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \
 	network.o mac.o inet_cidr_ntop.o inet_net_pton.o \
-	ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \
+	ri_triggers.o pg_lzcompress.o pg_rbcompress.o pg_locale.o formatting.o \
 	ascii.o quote.o pgstatfuncs.o encode.o dbsize.o genfile.o trigfuncs.o \
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..bcebed2
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,353 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is a delta encoding scheme specific to PostgreSQL and designed
+ *		to compress similar tuples. It can be used as it is or extended for
+ *		other purpose in PostgrSQL if required.
+ *
+ *		Currently, this just checks for a common prefix and/or suffix, but
+ *		the output format is similar to the LZ format used in pg_lzcompress.c.
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_rbcompress.h"
+
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGRB_HISTORY_SIZE		4096
+#define PGRB_MIN_MATCH			4
+
+
+
+/* ----------
+ * The provided standard strategies
+ * ----------
+ */
+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	25,							/* Require 25% compression rate, or not worth
+								 * it */
+};
+const PGRB_Strategy *const PGRB_strategy_default = &strategy_default_data;
+
+
+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pgrb_delta_encode - find common prefix/suffix between inputs and encode.
+ *
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	int32		result_size;
+	int32		result_max;
+	int32		need_rate;
+	int			prefixlen;
+	int			suffixlen;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 * XXX: still true?
+	 */
+	if (hlen >= PGRB_HISTORY_SIZE || hlen < PGRB_MIN_MATCH)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+	{
+		result_max = (slen * (100 - need_rate)) / 100;
+	}
+
+	for (prefixlen = 0; prefixlen < hlen && prefixlen < slen; prefixlen++)
+	{
+		if (history[prefixlen] != source[prefixlen])
+			break;
+	}
+	if (prefixlen < PGRB_MIN_MATCH)
+		prefixlen = 0;
+
+	hp = &history[hlen - 1];
+	dp = &source[slen - 1];
+	suffixlen = 0;
+	while (hp >= &history[prefixlen] && dp >= &source[prefixlen])
+	{
+		if (*hp != *dp)
+			break;
+		hp--;
+		dp--;
+		suffixlen++;
+	}
+	if (suffixlen < PGRB_MIN_MATCH)
+		suffixlen = 0;
+
+	/* FIXME: need to be more careful here, to make sure we don't
+	 * overflow the buffer!
+	 */
+	if (slen - prefixlen - suffixlen > (slen * need_rate) / 100)
+		return false;
+
+	/* Ok, this is worth delta encoding. */
+
+	/* output prefix as a tag */
+	pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, prefixlen, hlen);
+
+	/* output bytes between prefix and suffix as literals */
+	dp = &source[prefixlen];
+	dend = &source[slen - suffixlen];
+	while (dp < dend)
+	{
+		pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/* output suffix as a tag */
+	pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, suffixlen, 0);
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+	if (result_size > result_max)
+		return false;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 47e3022..51d6925 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_rbcompress.h b/src/include/utils/pg_rbcompress.h
new file mode 100644
index 0000000..effba23
--- /dev/null
+++ b/src/include/utils/pg_rbcompress.h
@@ -0,0 +1,58 @@
+/* ----------
+ * pg_rbcompress.h -
+ *
+ *	Definitions for the PostgreSQL specific encoding scheme
+ *
+ * src/include/utils/pg_rbcompress.h
+ * ----------
+ */
+
+#ifndef _PG_RBCOMPRESS_H_
+#define _PG_RBCOMPRESS_H_
+
+
+/* ----------
+ * PGRB_Strategy -
+ *
+ *		Some values that control the compression algorithm.
+ *
+ *		min_input_size		Minimum input data size to consider compression.
+ *
+ *		max_input_size		Maximum input data size to consider compression.
+ *
+ *		min_comp_rate		Minimum compression rate (0-99%) to require.
+ *							Regardless of min_comp_rate, the output must be
+ *							smaller than the input, else we don't store
+ *							compressed.
+ * ----------
+ */
+typedef struct PGRB_Strategy
+{
+	int32		min_input_size;
+	int32		max_input_size;
+	int32		min_comp_rate;
+} PGRB_Strategy;
+
+
+/* ----------
+ * The standard strategies
+ *
+ *		PGRB_strategy_default		Recommended default strategy for WAL
+ *									compression.
+ * ----------
+ */
+extern const PGRB_Strategy *const PGRB_strategy_default;
+
+
+/* ----------
+ * Global function declarations
+ * ----------
+ */
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGRB_Strategy *strategy);
+extern void pgrb_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
+
+#endif   /* _PG_RBCOMPRESS_H_ */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#92

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#91)

1 attachment(s)

On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 01/30/2014 08:53 AM, Amit Kapila wrote:

Is it possible to do for both prefix and suffix together, basically
the question I
have in mind is what will be deciding factor for switching from hash table
mechanism to string comparison mode for suffix. Do we switch when we find
long enough match?

I think you got it backwards. You don't switch from hash table mechanism to
string comparison. You do the prefix/suffix comparison *first*, and run the
hash table algorithm only on the "middle" part, between the common prefix
and suffix.

Can we do this optimization after the basic version is acceptable?

I would actually suggest doing that first. Perhaps even ditch the whole
history table approach and do *only* the scan for prefix and suffix. That's
very cheap, and already covers a large fraction of UPDATEs that real
applications do. In particular, it's optimal for the case that you update
only a single column, something like "UPDATE foo SET bar = bar + 1".

I'm pretty sure the overhead of that would be negligible, so we could always
enable it. There are certainly a lot of scenarios where prefix/suffix
detection alone wouldn't help, but so what.

Attached is a quick patch for that, if you want to test it.

I have done one test where there is a large suffix match, but
not large enough that it can compress more than 75% of string,
the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
not much, but there is no I/O reduction as well. However for same
case there is both significant WAL reduction and CPU gain with
pgrb_delta_encoding_v8.patch

I have updated "ten long fields, all changed" such that there is large
suffix match. Updated script is attached.

pgrb_delta_encoding_v8.patch

testname | wal_generated | duration
------------------------------+---------------+------------------
ten long fields, all changed | 733969304 | 23.916286945343
ten long fields, all changed | 733977040 | 23.6019561290741
ten long fields, all changed | 737384632 | 24.2645490169525

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#93

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Amit Kapila (#92)

On 02/05/2014 04:48 PM, Amit Kapila wrote:

I have done one test where there is a large suffix match, but
not large enough that it can compress more than 75% of string,
the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
not much, but there is no I/O reduction as well.

Hmm, it's supposed to compress if you save at least 25%, not 75%.
Apparently I got that backwards in the patch...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#90)

On Wed, Feb 5, 2014 at 5:13 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 02/05/2014 07:54 AM, Amit Kapila wrote:

On Tue, Feb 4, 2014 at 11:58 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Tue, Feb 4, 2014 at 12:39 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Now there is approximately 1.4~5% CPU gain for
"hundred tiny fields, half nulled" case

Assuming that the logic isn't buggy, a point in need of further study,
I'm starting to feel like we want to have this. And I might even be
tempted to remove the table-level off switch.

I have tried to stress on worst case more, as you are thinking to
remove table-level switch and found that even if we increase the
data by approx. 8 times ("ten long fields, all changed", each field
contains
80 byte data), the CPU overhead is still < 5% which clearly shows that
the overhead doesn't increase much even if the length of unmatched data
is increased by much larger factor.
So the data for worst case adds more weight to your statement
("remove table-level switch"), however there is no harm in keeping
table-level option with default as 'true' and if some users are really
sure
the updates in their system will have nothing in common, then they can
make this new option as 'false'.

Below is data for the new case " ten long fields, all changed" added
in attached script file:

That's not the worst case, by far.

First, note that the skipping while scanning new tuple is only performed in
the first loop. That means that as soon as you have a single match, you fall
back to hashing every byte. So for the worst case, put one 4-byte field as
the first column, and don't update it.

Also, I suspect the runtimes in your test were dominated by I/O. When I
scale down the number of rows involved so that the whole test fits in RAM, I
get much bigger differences with and without the patch. You might also want
to turn off full_page_writes, to make the effect clear with less data.

So with this test, the overhead is very significant.

With the skipping logic, another kind of "worst case" case is that you have
a lot of similarity between the old and new tuple, but you miss it because
you skip.

This is exactly the reason why I have not kept skipping logic in second
pass(loop), but I think may be it would have been better to keep it not
as aggressive as in first pass. The basic idea I had in mind is that if we
get match, then there is high chance that we get match in consecutive
positions.

I think we should see this patch as I/O reduction feature rather than CPU
gain/overhead because the I/O reduction in WAL has some other has
some other benefits like transfer for replication, in archiving, recovery,
basically where-ever there is disk read operation, as I/O reduction will
amount to less data read and which can be beneficial in many ways.

Sometime back, I was reading article on benefits of compression
in database where the benefits are shown something like what
I said above (atleast that is what I understood from it). Link to that
article is:
http://db2guys.wordpress.com/2013/08/23/compression/

Another thing is that I think it might be difficult to get negligible
overhead for data which is very less or non-compressible, that's
why it is preferred to have compression for table enabled with
switch.

Is it viable to see here, what is the best way to get I/O reduction
for most cases and provide a switch so that for worst cases
user can make it off?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#93)

On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 02/05/2014 04:48 PM, Amit Kapila wrote:

I have done one test where there is a large suffix match, but
not large enough that it can compress more than 75% of string,
the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
not much, but there is no I/O reduction as well.

Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently
I got that backwards in the patch...

Okay I think that is right, may be I can change the that check to see the
difference, but in general isn't it going to loose compression in much more
cases like if there is less than 25% match in prefix/suffix, but
more than 50% match in between the string.

While debugging, I noticed that it compresses less than history table
approach for general cases when internally update is done like for
Truncate table.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96

Peter Geoghegan

pg@heroku.com

almost 12 years ago

In reply to: Heikki Linnakangas (#89)

On Wed, Feb 5, 2014 at 12:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I think there's zero overlap. They're completely complimentary features.
It's not like normal WAL records have an irrelevant volume.

Correct. Compressing a full-page image happens on the first update after a
checkpoint, and the diff between old and new tuple is not used in that case.

Uh, I really just meant that one thing that might overlap is
considerations around the choice of compression algorithm. I think
that there was some useful discussion of that on the other thread as
well.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#91)

On Wed, Feb 5, 2014 at 6:59 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Attached is a quick patch for that, if you want to test it.

But if we really just want to do prefix/suffix compression, this is a
crappy and expensive way to do it. We needn't force everything
through the pglz tag format just because we elide a common prefix or
suffix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#93)

On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 02/05/2014 04:48 PM, Amit Kapila wrote:

I have done one test where there is a large suffix match, but
not large enough that it can compress more than 75% of string,
the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
not much, but there is no I/O reduction as well.

Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently
I got that backwards in the patch...

So If I understand the code correctly, the new check should be

if (prefixlen + suffixlen < (slen * need_rate) / 100)
return false;

rather than

if (slen - prefixlen - suffixlen > (slen * need_rate) / 100)
return false;

Please confirm, else any validation for this might not be useful?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#90)

On Wed, Feb 5, 2014 at 6:43 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

So, I came up with the attached worst case test, modified from your latest
test suite.

unpatched:

testname | wal_generated | duration
--------------------------------------+---------------+------------------
ten long fields, all but one changed | 343385312 | 2.20806908607483
ten long fields, all but one changed | 336263592 | 2.18997097015381
ten long fields, all but one changed | 336264504 | 2.17843413352966
(3 rows)

pgrb_delta_encoding_v8.patch:

testname | wal_generated | duration
--------------------------------------+---------------+------------------
ten long fields, all but one changed | 338356944 | 3.33501315116882
ten long fields, all but one changed | 344059272 | 3.37364101409912
ten long fields, all but one changed | 336257840 | 3.36244201660156
(3 rows)

So with this test, the overhead is very significant.

Yuck. Well that sucks.

With the skipping logic, another kind of "worst case" case is that you have
a lot of similarity between the old and new tuple, but you miss it because
you skip. For example, if you change the first few columns, but leave a
large text column at the end of the tuple unchanged.

I suspect there's no way to have our cake and eat it, too. Most of
the work that Amit has done on this patch in the last few revs is to
cut back CPU overhead in the cases where the patch can't help because
the tuple has been radically modified. If we're trying to get maximum
compression, we need to go the other way: for example, we could just
feed both the old and new tuples through pglz (or snappy, or
whatever). That would allow us to take advantage not only of
similarity between the old and new tuples but also internal
duplication within either the old or the new tuple, but it would also
cost more CPU. The concern with minimizing overhead in cases where
the compression doesn't help has thus far pushed us in the opposite
direction, namely passing over compression opportunities that a more
aggressive algorithm could find in order to keep the overhead low.

Off-hand, I'm wondering why we shouldn't apply the same skipping
algorithm that Amit is using at the beginning of the string for the
rest of it as well. It might be a little too aggressive (maybe the
skip distance shouldn't increase by quite as much as doubling every
time, or not beyond 16/32 bytes?) but I don't see why the general
principle isn't sound wherever we are in the tuple.

Unfortunately, despite changing things to make a history entry only
every 4th character, building the history is still pretty expensive.
By the time we even begin looking at the tuple we're gonna compress,
we've already spent something like half the total effort, and of
course we have to go further than that before we know whether our
attempt to compress is actually going anywhere. I think that's the
central problem here. pglz has several safeguards to ensure that it
doesn't do too much work in vain: we abort if we find nothing
compressible within first_success_by bytes, or if we emit enough total
output to be certain that we won't meet the need_rate threshold.
Those safeguards are a lot less effective here because they can't be
applied until *after* we've already paid the cost of building the
history. If we could figure out some way to apply those guards, or
other guards, earlier in the algorithm, we could do a better job
mitigating the worst-case scenarios, but I don't have a good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100

Gavin Flower

GavinFlower@archidevsys.co.nz

almost 12 years ago

In reply to: Robert Haas (#99)

On 06/02/14 16:59, Robert Haas wrote:

On Wed, Feb 5, 2014 at 6:43 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

So, I came up with the attached worst case test, modified from your latest
test suite.

unpatched:

testname | wal_generated | duration
--------------------------------------+---------------+------------------
ten long fields, all but one changed | 343385312 | 2.20806908607483
ten long fields, all but one changed | 336263592 | 2.18997097015381
ten long fields, all but one changed | 336264504 | 2.17843413352966
(3 rows)

pgrb_delta_encoding_v8.patch:

testname | wal_generated | duration
--------------------------------------+---------------+------------------
ten long fields, all but one changed | 338356944 | 3.33501315116882
ten long fields, all but one changed | 344059272 | 3.37364101409912
ten long fields, all but one changed | 336257840 | 3.36244201660156
(3 rows)

So with this test, the overhead is very significant.

Yuck. Well that sucks.

With the skipping logic, another kind of "worst case" case is that you have
a lot of similarity between the old and new tuple, but you miss it because
you skip. For example, if you change the first few columns, but leave a
large text column at the end of the tuple unchanged.

I suspect there's no way to have our cake and eat it, too. Most of
the work that Amit has done on this patch in the last few revs is to
cut back CPU overhead in the cases where the patch can't help because
the tuple has been radically modified. If we're trying to get maximum
compression, we need to go the other way: for example, we could just
feed both the old and new tuples through pglz (or snappy, or
whatever). That would allow us to take advantage not only of
similarity between the old and new tuples but also internal
duplication within either the old or the new tuple, but it would also
cost more CPU. The concern with minimizing overhead in cases where
the compression doesn't help has thus far pushed us in the opposite
direction, namely passing over compression opportunities that a more
aggressive algorithm could find in order to keep the overhead low.

Off-hand, I'm wondering why we shouldn't apply the same skipping
algorithm that Amit is using at the beginning of the string for the
rest of it as well. It might be a little too aggressive (maybe the
skip distance shouldn't increase by quite as much as doubling every
time, or not beyond 16/32 bytes?) but I don't see why the general
principle isn't sound wherever we are in the tuple.

Unfortunately, despite changing things to make a history entry only
every 4th character, building the history is still pretty expensive.
By the time we even begin looking at the tuple we're gonna compress,
we've already spent something like half the total effort, and of
course we have to go further than that before we know whether our
attempt to compress is actually going anywhere. I think that's the
central problem here. pglz has several safeguards to ensure that it
doesn't do too much work in vain: we abort if we find nothing
compressible within first_success_by bytes, or if we emit enough total
output to be certain that we won't meet the need_rate threshold.
Those safeguards are a lot less effective here because they can't be
applied until *after* we've already paid the cost of building the
history. If we could figure out some way to apply those guards, or
other guards, earlier in the algorithm, we could do a better job
mitigating the worst-case scenarios, but I don't have a good idea.

Surely the weighting should be done according to the relative scarcity
of processing power vs I/O bandwidth? I get the impression that
different workloads and hardware configurations may favour conserving
one of processor or I/O resources. Would it be feasible to have
different logic, depending on the trade-offs identified?

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#98)

1 attachment(s)

On Thu, Feb 6, 2014 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 5, 2014 at 8:50 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 02/05/2014 04:48 PM, Amit Kapila wrote:

I have done one test where there is a large suffix match, but
not large enough that it can compress more than 75% of string,
the CPU overhead with wal-update-prefix-suffix-encode-1.patch is
not much, but there is no I/O reduction as well.

Hmm, it's supposed to compress if you save at least 25%, not 75%. Apparently
I got that backwards in the patch...

So If I understand the code correctly, the new check should be

if (prefixlen + suffixlen < (slen * need_rate) / 100)
return false;

rather than

if (slen - prefixlen - suffixlen > (slen * need_rate) / 100)
return false;

Considering above change as correct, I have tried to see the worst
case overhead for this patch by having new tuple such that after
25% or so of suffix/prefix match, there is a small change in tuple
and kept rest of tuple same as old tuple and it shows overhead
for this patch as well.

Updated test script is attached.

wal-update-prefix-suffix-encode-1.patch

One minor point, can we avoid having prefix tag if prefixlen is 0.

+ /* output prefix as a tag */
+ pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, prefixlen, hlen);

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#102

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#94)

1 attachment(s)

On Wed, Feb 5, 2014 at 8:56 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Feb 5, 2014 at 5:13 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 02/05/2014 07:54 AM, Amit Kapila wrote:

That's not the worst case, by far.

First, note that the skipping while scanning new tuple is only performed in
the first loop. That means that as soon as you have a single match, you fall
back to hashing every byte. So for the worst case, put one 4-byte field as
the first column, and don't update it.

Also, I suspect the runtimes in your test were dominated by I/O. When I
scale down the number of rows involved so that the whole test fits in RAM, I
get much bigger differences with and without the patch. You might also want
to turn off full_page_writes, to make the effect clear with less data.

So with this test, the overhead is very significant.

With the skipping logic, another kind of "worst case" case is that you have
a lot of similarity between the old and new tuple, but you miss it because
you skip.

This is exactly the reason why I have not kept skipping logic in second
pass(loop), but I think may be it would have been better to keep it not
as aggressive as in first pass.

I have tried to merge pass-1 and pass-2 and kept skipping logic as same,
and it have reduced the overhead to a good extent but not completely for
the new case you have added. This change is to check if it can reduce
overhead, if we want to proceed, may be we can limit the skip factor, so
that chance of skipping some match data is reduced.

New version of patch is attached with mail

Unpatched

pgrb_delta_encoding_v9.patch

It appears to me that it can be good idea to merge both the patches
(prefix-suffix encoding + delta-encoding) in a way such that if we
get reasonable compression (50% or so) with prefix-suffix, then we
can return without doing delta encoding and if compression is lesser
than we can do delta encoding for rest of tuple. The reason I think it
will be good because by just doing prefix-suffix we might leave many
cases where good compression is possible.
If you think it is viable way, then I can merge both the patches and
check the results.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

pgrb_delta_encoding_v9.patchapplication/octet-stream; name=pgrb_delta_encoding_v9.patchDownload

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index aea9d40..3bf5728 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..2724188 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f63185..5d874eb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2332,6 +2332,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 1ae9fa0..04dc17c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -26,7 +26,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	rowtypes.o regexp.o regproc.o ruleutils.o selfuncs.o \
 	tid.o timestamp.o varbit.o varchar.o varlena.o version.o xid.o \
 	network.o mac.o inet_cidr_ntop.o inet_net_pton.o \
-	ri_triggers.o pg_lzcompress.o pg_locale.o formatting.o \
+	ri_triggers.o pg_lzcompress.o pg_rbcompress.o pg_locale.o formatting.o \
 	ascii.o quote.o pgstatfuncs.o encode.o dbsize.o genfile.o trigfuncs.o \
 	tsginidx.o tsgistidx.o tsquery.o tsquery_cleanup.o tsquery_gist.o \
 	tsquery_op.o tsquery_rewrite.o tsquery_util.o tsrank.o \
diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..2b6254b
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,667 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is a delta encoding scheme specific to PostgreSQL and designed
+ *		to compress similar tuples. It can be used as it is or extended for
+ *		other purpose in PostgrSQL if required.
+ *
+ *      It uses a simple history table to represent history data (old tuple)
+ *		and then compress source string (new tuple) using the history data.
+ *		It uses LZ compression format to store encoded string except for
+ *		using PGLZ_Header.
+ *
+ *
+ *		The decompression algorithm works exactly same as LZ, except
+ *		for processing of its PGLZ_Header.
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_rbcompress.h"
+
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGRB_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
+#define PGRB_HISTORY_SIZE		4096
+#define PGRB_MAX_MATCH			273
+#define PGRB_CHUNK_SIZE		4
+
+
+
+/* ----------
+ * The provided standard strategies
+ * ----------
+ */
+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	25,							/* Require 25% compression rate, or not worth
+								 * it */
+	1024,						/* Give up if no compression in the first 1KB */
+	128,						/* Stop history lookup if a match of 128 bytes
+								 * is found */
+	50							/* Lower good match size by 50% at every loop
+								 * iteration */
+};
+const PGRB_Strategy *const PGRB_strategy_default = &strategy_default_data;
+
+
+/* ----------
+ * Statically allocated work arrays for history
+ * ----------
+ */
+static int16 hist_start[PGRB_MAX_HISTORY_LISTS];
+static uint16 rb_hist_entries[PGRB_HISTORY_SIZE + 1];
+
+/*
+ * Element 0 in hist_entries is unused, and means 'invalid'. Likewise,
+ * INVALID_ENTRY_PTR in next/prev pointers mean 'invalid'.
+ */
+#define INVALID_ENTRY			0
+
+
+/* ----------
+ * pgrb_hist_idx -
+ *
+ *		Computes the history table slot for the lookup by the next 4
+ *		characters in the input.
+ *
+ * NB: because we use the next 4 characters, we are not guaranteed to
+ * find 3-character matches; they very possibly will be in the wrong
+ * hash list.  This seems an acceptable tradeoff for spreading out the
+ * hash keys more.
+ * ----------
+ */
+#define pgrb_hist_idx(_s, _pos, _mask) (										\
+			  (((_s)[_pos - 3] << 6) ^ ((_s)[_pos - 2] << 4) ^								\
+			  ((_s)[_pos - 1] << 2) ^ (_s)[_pos]) & (_mask)				\
+		)
+
+
+/* The same, calculated in a rolling fashion */
+#define pgrb_hash_init(_p,hindex)								\
+	hindex = ((_p)[0] << 4) ^ ((_p)[1] << 2) ^ ((_p)[2])
+
+#define pgrb_hash_roll(_p, hindex)								\
+	hindex = (hindex << 2) ^ (_p)[3]
+
+#define pgrb_hash_unroll(_p, hindex)								\
+	hindex = hindex ^ ((_p)[0] << 6)
+
+
+
+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_find_match -
+ *
+ * ----------
+ */
+static inline int
+pgrb_find_match(int16 *hstart, const char *hbegin,
+				const char* input_chunk_start,
+				int16 *lenp, int16 *offp,
+				const char *hend, int hindex,
+				int good_match, int good_drop)
+{
+	const char  *hp;
+	const char  *ip;
+	int16		history_chunk_size;
+	int32		len = 0;
+	int32		off = 0;
+	int16		hoff;
+
+
+	hoff = hstart[hindex];
+
+	/* We use offset 0 to mean invalid entries. */
+	while (hoff != INVALID_ENTRY)
+	{
+		int32		thisoff;
+		int32		thislen;
+		int32		maxlen;
+
+		thislen = 0;
+
+		hp = hbegin + hoff;
+		ip = input_chunk_start;
+
+		maxlen = PGRB_MAX_MATCH;
+		if (hend - hp < maxlen)
+			maxlen = hend - hp;
+
+		/*
+		 * we cannot get a bigger chunk if the maximum length for
+		 * start of match position is less than already found match
+		 * length.
+		 */
+		if (len > maxlen)
+			break;
+
+		thisoff = hend - hp;
+		if (thisoff >= 0x0fff)
+			break;
+
+		/*
+		 * Determine length of match. A better match must be larger than the
+		 * best so far. And if we already have a match of 16 or more bytes,
+		 * it's worth the call overhead to use memcmp() to check if this match
+		 * is equal for the same size. After that we must fallback to
+		 * chunk by chunk comparison to know the exact position where
+		 * the diff occurred.
+		 */
+		if (len >= 16)
+		{
+			if (memcmp(ip, hp, len) == 0)
+			{
+				thislen = len;
+				ip += len;
+				hp += len;
+				while (*ip == *hp && thislen < maxlen)
+				{
+					history_chunk_size = PGRB_CHUNK_SIZE;
+					while (history_chunk_size > 0)
+					{
+						if (*hp++ != *ip++)
+							break;
+						else
+							--history_chunk_size;
+					}
+
+					/* consider only complete chunk matches. */
+					if (history_chunk_size == 0)
+						thislen += PGRB_CHUNK_SIZE;
+				}
+			}
+		}
+		else
+		{
+			while (*ip == *hp && thislen < maxlen)
+			{
+				history_chunk_size = PGRB_CHUNK_SIZE;
+				while (history_chunk_size > 0)
+				{
+					if (*hp++ != *ip++)
+						break;
+					else
+						--history_chunk_size;
+				}
+
+				/* consider only complete chunk matches. */
+				if (history_chunk_size == 0)
+					thislen += PGRB_CHUNK_SIZE;
+			}
+		}
+		
+		/*
+		 * Remember this match as the best (if it is)
+		 */
+		if (thislen > len)
+		{
+			len = thislen;
+			off = thisoff;
+		}
+
+		/*
+		 * Advance to the next history entry
+		 */
+		hoff = rb_hist_entries[hoff];
+
+		/*
+		 * Be happy with lesser good matches the more entries we visited. But
+		 * no point in doing calculation if we're at end of list.
+		 */
+		if (hoff != INVALID_ENTRY)
+		{
+			if (len >= good_match)
+				break;
+			good_match -= (good_match * good_drop) / 100;
+		}
+	}
+
+	/*
+	 * Return match information only if it results at least in one chunk
+	 * reduction.
+	 */
+	if (len >= PGRB_CHUNK_SIZE)
+	{
+		*lenp = len;
+		*offp = off;
+		return 1;
+	}
+
+	return 0;
+}
+
+static int
+choose_hash_size(int slen)
+{
+	int			hashsz;
+
+	/*
+	 * Experiments suggest that these hash sizes work pretty well. A large
+	 * hash table minimizes collision, but has a higher startup cost. For a
+	 * small input, the startup cost dominates. The table size must be a power
+	 * of two.
+	 */
+	if (slen < 128)
+		hashsz = 512;
+	else if (slen < 256)
+		hashsz = 1024;
+	else if (slen < 512)
+		hashsz = 2048;
+	else if (slen < 1024)
+		hashsz = 4096;
+	else
+		hashsz = 8192;
+	return hashsz;
+}
+
+/* ----------
+ * Chunkwise Delta Encoding - Compresses source data by referring history.
+ *							
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *dp_unmatched_chunk_start;
+	const char *dp_unmatched_start;
+	const char *hp = history;
+	const char *hend = history + hlen;
+	const char *hend_tmp;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	bool		found_match = false;
+	int16		match_len = 0;
+	int16		match_off;
+	int32		result_size;
+	int32		result_max;
+	int32		result_min;
+	int32		need_rate;
+	int32		good_match;
+	int32		good_drop;
+	int			hashsz;
+	int			mask;
+	int32		hindex;
+	int16		skip_bytes = 4;
+	int			hoff;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 */
+	if (hlen >= PGRB_HISTORY_SIZE || hlen < 4)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;
+
+	/*
+	 * Limit the match parameters to the supported range.
+	 */
+	good_match = strategy->match_size_good;
+	if (good_match > PGRB_MAX_MATCH)
+		good_match = PGRB_MAX_MATCH;
+	else if (good_match < 17)
+		good_match = 17;
+
+	good_drop = strategy->match_size_drop;
+	if (good_drop < 0)
+		good_drop = 0;
+	else if (good_drop > 100)
+		good_drop = 100;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (strategy->match_size_good <= 0 ||
+		slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		result_max = (slen / 100) * (100 - need_rate);
+		result_min = (slen / 100) * need_rate;
+	}
+	else
+	{
+		result_max = (slen * (100 - need_rate)) / 100;
+		result_min = (slen * (need_rate)) / 100;
+	}
+
+	hashsz = choose_hash_size(hlen/PGRB_CHUNK_SIZE);
+	mask = hashsz - 1;
+
+	/*
+	 * Initialize the history lists to empty.  We do not need to zero the
+	 * rb_hist_entries[] array; its entries are initialized as they are used.
+	 */
+	memset(hist_start, INVALID_ENTRY, hashsz * sizeof(int16));
+
+	/*
+	 * Form the history table using fixed PGRB_CHUNK_SIZE bytes.
+	 */
+	hoff = hlen - 4;
+	hend_tmp = hend - 4;
+	while (hend_tmp - 4 > hp)
+	{
+		hindex = pgrb_hist_idx(hp, hoff, mask);
+
+		/*
+		 * Need to position offset to 4 bytes prior, before storing in
+		 * history table as the hash calculation is done from that location.
+		 */
+		hoff -= PGRB_CHUNK_SIZE;
+
+		/* add this offset to the history table */
+		rb_hist_entries[hoff+1] = hist_start[hindex];
+		hist_start[hindex] = hoff+1;
+
+		hend_tmp -= PGRB_CHUNK_SIZE;
+	}
+
+
+	/*
+	 * Loop through the input. To optimize the encoding we don't
+	 * copy any unmatched bytes till we find a match.
+	 */
+	match_off = 0;
+	dp_unmatched_chunk_start = dp;
+	dp_unmatched_start = dp;
+	pgrb_hash_init(dp, hindex);
+	while (dp < dend - 4)
+	{
+		/* If we already exceeded the maximum result size, fail. */
+		if (bp - bstart >= result_max ||
+			dp - dp_unmatched_start >= result_max)
+			return false;
+
+		/*
+		 * If we don't find any match till result minimum size,
+		 * then fall out.
+		 */
+		if (!found_match && dend - dp <= result_min)
+			return false;
+
+		/*
+		 * Try to find a match in the history
+		 */
+		pgrb_hash_roll(dp, hindex);
+
+		if (pgrb_find_match(hist_start, history, dp,
+							&match_len, &match_off,
+							hend, (hindex & mask),
+							good_match, good_drop))
+		{
+			while (dp_unmatched_start < dp)
+			{
+				pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_unmatched_start);
+				dp_unmatched_start++;
+			}
+			/*
+			 * Create the tag and add history entries for all matched
+			 * characters.
+			 */
+			pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, match_len, match_off);
+			pgrb_hash_unroll(dp, hindex);
+			dp += match_len;
+			dp_unmatched_start = dp;
+			dp_unmatched_chunk_start = dp;
+			found_match = true;
+		}
+		else
+		{
+			pgrb_hash_unroll(dp, hindex);
+			/*
+			 * If we don't find any match for first 4 bytes, then skip 4 bytes
+			 * and if we don't find match again for next 8 bytes, then skip 8
+			 * bytes and keep on doing the same until we find first match.
+			 * There is a chance that we miss some bytes for compression, but
+			 * it should not effect much as we are doing this only till we find
+			 * first match.
+			 */
+			if (dp - dp_unmatched_chunk_start >= skip_bytes)
+			{
+				dp += skip_bytes;
+				dp_unmatched_chunk_start = dp;
+				skip_bytes *= 2;
+			}
+			else
+				dp++;
+		}
+	}
+
+	/* Handle the last few bytes as literals */
+	while (dp_unmatched_start < dend)
+	{
+		pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp_unmatched_start);
+		dp_unmatched_start++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 11ab277..87402b0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_rbcompress.h b/src/include/utils/pg_rbcompress.h
new file mode 100644
index 0000000..d5105db
--- /dev/null
+++ b/src/include/utils/pg_rbcompress.h
@@ -0,0 +1,79 @@
+/* ----------
+ * pg_rbcompress.h -
+ *
+ *	Definitions for the PostgreSQL specific encoding scheme
+ *
+ * src/include/utils/pg_rbcompress.h
+ * ----------
+ */
+
+#ifndef _PG_RBCOMPRESS_H_
+#define _PG_RBCOMPRESS_H_
+
+
+/* ----------
+ * PGRB_Strategy -
+ *
+ *		Some values that control the compression algorithm.
+ *
+ *		min_input_size		Minimum input data size to consider compression.
+ *
+ *		max_input_size		Maximum input data size to consider compression.
+ *
+ *		min_comp_rate		Minimum compression rate (0-99%) to require.
+ *							Regardless of min_comp_rate, the output must be
+ *							smaller than the input, else we don't store
+ *							compressed.
+ *
+ *		first_success_by	Abandon compression if we find no compressible
+ *							data within the first this-many bytes.
+ *
+ *		match_size_good		The initial GOOD match size when starting history
+ *							lookup. When looking up the history to find a
+ *							match that could be expressed as a tag, the
+ *							algorithm does not always walk back entirely.
+ *							A good match fast is usually better than the
+ *							best possible one very late. For each iteration
+ *							in the lookup, this value is lowered so the
+ *							longer the lookup takes, the smaller matches
+ *							are considered good.
+ *
+ *		match_size_drop		The percentage by which match_size_good is lowered
+ *							after each history check. Allowed values are
+ *							0 (no change until end) to 100 (only check
+ *							latest history entry at all).
+ * ----------
+ */
+typedef struct PGRB_Strategy
+{
+	int32		min_input_size;
+	int32		max_input_size;
+	int32		min_comp_rate;
+	int32		first_success_by;
+	int32		match_size_good;
+	int32		match_size_drop;
+} PGRB_Strategy;
+
+
+/* ----------
+ * The standard strategies
+ *
+ *		PGRB_strategy_default		Recommended default strategy for WAL
+ *									compression.
+ * ----------
+ */
+extern const PGRB_Strategy *const PGRB_strategy_default;
+
+
+/* ----------
+ * Global function declarations
+ * ----------
+ */
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGRB_Strategy *strategy);
+extern void pgrb_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
+
+#endif   /* _PG_RBCOMPRESS_H_ */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#103

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#101)

On Thu, Feb 6, 2014 at 5:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 6, 2014 at 9:13 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Considering above change as correct, I have tried to see the worst
case overhead for this patch by having new tuple such that after
25% or so of suffix/prefix match, there is a small change in tuple
and kept rest of tuple same as old tuple and it shows overhead
for this patch as well.

Updated test script is attached.

Unpatched
testname | wal_generated | duration
----------------------------------+---------------+------------------
ten long fields, 8 bytes changed | 348843824 | 5.56866788864136
ten long fields, 8 bytes changed | 348844800 | 5.84434294700623
ten long fields, 8 bytes changed | 350500000 | 5.92329406738281
(3 rows)

wal-update-prefix-suffix-encode-1.patch

testname | wal_generated | duration
----------------------------------+---------------+------------------
ten long fields, 8 bytes changed | 348845624 | 6.92243480682373
ten long fields, 8 bytes changed | 348847000 | 8.35828399658203
ten long fields, 8 bytes changed | 350204752 | 7.61826491355896
(3 rows)

One minor point, can we avoid having prefix tag if prefixlen is 0.
+ /* output prefix as a tag */
+ pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, prefixlen, hlen);

I think generating out tag for suffix/prefix has one bug i.e it doesn't
consider the max length of 273 bytes (PGLZ_MAX_MATCH ) which
is mandatory for LZ format.

One more point about this patch is that in function pgrb_delta_encode(),
is it mandatory to return at end in below check:

if (result_size > result_max)
return false;

I mean to say as before starting for copying literal bytes we have
already ensured that the compressed data is greater than >25%, so
may be we can avoid this check. I have tried to take the data by
removing this check and found that it reduces overhead and improves
WAL reduction as well. The data is as below (compare this with data
in above mail for unpatched version data):

If we want to go with this approach, then I think apart from above
points there is no major change required (may be some comments,
function names etc. can be improved).

But if we really just want to do prefix/suffix compression, this is a
crappy and expensive way to do it. We needn't force everything
through the pglz tag format just because we elide a common prefix or
suffix.

Here are you bothered about below code where the patch is
doing byte-by-byte copy after prefix/suffix match?

/* output bytes between prefix and suffix as literals */
dp = &source[prefixlen];
dend = &source[slen - suffixlen];
while (dp < dend)
{
pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
dp++; /* Do not do this ++ in the line above! */
}

I think if we want to change LZ format, it will be bit more work and
verification for decoding has to be done much more strenuously.

Note - During performance test, I have focussed mainly on worst case,
because we already know that this idea is good for best and average cases.
However if we decide that this is better and good to proceed, I can take
the data for other cases as well.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Bruce Momjian

bruce@momjian.us

almost 12 years ago

In reply to: Peter Geoghegan (#96)

On Wed, Feb 5, 2014 at 10:57:57AM -0800, Peter Geoghegan wrote:

On Wed, Feb 5, 2014 at 12:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I think there's zero overlap. They're completely complimentary features.
It's not like normal WAL records have an irrelevant volume.

Correct. Compressing a full-page image happens on the first update after a
checkpoint, and the diff between old and new tuple is not used in that case.

Uh, I really just meant that one thing that might overlap is
considerations around the choice of compression algorithm. I think
that there was some useful discussion of that on the other thread as
well.

Yes, that was my point. I though the compression of full-page images
was a huge win and that compression was pretty straight-forward, except
for the compression algorithm. If the compression algorithm issue is
resolved, can we move move forward with the full-page compression patch?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Bruce Momjian (#104)

On Tue, Feb 11, 2014 at 10:07 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Feb 5, 2014 at 10:57:57AM -0800, Peter Geoghegan wrote:

On Wed, Feb 5, 2014 at 12:50 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I think there's zero overlap. They're completely complimentary features.
It's not like normal WAL records have an irrelevant volume.

Correct. Compressing a full-page image happens on the first update after a
checkpoint, and the diff between old and new tuple is not used in that case.

Uh, I really just meant that one thing that might overlap is
considerations around the choice of compression algorithm. I think
that there was some useful discussion of that on the other thread as
well.

Yes, that was my point. I though the compression of full-page images
was a huge win and that compression was pretty straight-forward, except
for the compression algorithm. If the compression algorithm issue is
resolved,

By issue, I assume you mean to say, which compression algorithm is
best for this patch.
For this patch, currently we have 2 algorithm's for which results have been
posted. As far as I understand Heikki is pretty sure that the latest algorithm
(compression using prefix-suffix match in old and new tuple) used for this
patch is better than the other algorithm in terms of CPU gain or overhead.
The performance data taken by me for the worst case for this algorithm
shows there is a CPU overhead for this algorithm as well.

OTOH the another algorithm (compression using old tuple as history) can be
a bigger win in terms I/O reduction in more number of cases.

In short, it is still not decided which algorithm to choose and whether
it can be enabled by default or it is better to have table level switch
to enable/disable it.

So I think the decision to be taken here is about below points:
1. Are we okay with I/O reduction at the expense of CPU for *worst* cases
and I/O reduction without impacting CPU (better overall tps) for
*favourable* cases?
2. If we are not okay with worst case behaviour, then can we provide
a table-level switch, so that it can be decided by user?
3. If none of above, then is there any other way to mitigate the worst
case behaviour or shall we just reject this patch and move on.

Given a choice to me, I would like to go with option-2, because I think
for most cases UPDATE statement will have same data for old and
new tuples except for some part of tuple (generally column's having large
text data are not modified), so we will be end up mostly in favourable cases
and surely for worst cases we don't want user to suffer from CPU overhead,
so a table-level switch is also required.

I think here one might argue that for some users it is not feasible to
decide whether their tuples data for UPDATE is going to be similar
or completely different and they are not at all ready for any risk for
CPU overhead, but they would be happy to see I/O reduction in which
case it is difficult to decide what should be the value of table-level
switch. Here I think the only answer is "nothing is free" in this world,
so either make sure about the application's behaviour for UPDATE
statement before going to production or just don't enable this switch and
be happy with the current behaviour.

On the other side there will be users who will be pretty certain about their
usage of UPDATE statement or atleast are ready to evaluate their
application if they can get such a huge gain, so it would be quite useful
feature for such users.

can we move move forward with the full-page compression patch?

In my opinion, it is not certain that whatever compression algorithm got
decided for this patch (if any) can be directly used for full-page
compression, some ideas could be used or may be the algorithm could be
tweaked a bit to make it usable for full-page compression.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106

Bruce Momjian

bruce@momjian.us

almost 12 years ago

In reply to: Amit Kapila (#105)

On Wed, Feb 12, 2014 at 10:02:32AM +0530, Amit Kapila wrote:

By issue, I assume you mean to say, which compression algorithm is
best for this patch.
For this patch, currently we have 2 algorithm's for which results have been
posted. As far as I understand Heikki is pretty sure that the latest algorithm
(compression using prefix-suffix match in old and new tuple) used for this
patch is better than the other algorithm in terms of CPU gain or overhead.
The performance data taken by me for the worst case for this algorithm
shows there is a CPU overhead for this algorithm as well.

OTOH the another algorithm (compression using old tuple as history) can be
a bigger win in terms I/O reduction in more number of cases.

In short, it is still not decided which algorithm to choose and whether
it can be enabled by default or it is better to have table level switch
to enable/disable it.

So I think the decision to be taken here is about below points:
1. Are we okay with I/O reduction at the expense of CPU for *worst* cases
and I/O reduction without impacting CPU (better overall tps) for
*favourable* cases?
2. If we are not okay with worst case behaviour, then can we provide
a table-level switch, so that it can be decided by user?
3. If none of above, then is there any other way to mitigate the worst
case behaviour or shall we just reject this patch and move on.

Given a choice to me, I would like to go with option-2, because I think
for most cases UPDATE statement will have same data for old and
new tuples except for some part of tuple (generally column's having large
text data are not modified), so we will be end up mostly in favourable cases
and surely for worst cases we don't want user to suffer from CPU overhead,
so a table-level switch is also required.

I think 99.9% of users are never going to adjust this so we had better
choose something we are happy to enable for effectively everyone. In my
reading, prefix/suffix seemed safe for everyone. We can always revisit
this if we think of something better later, as WAL format changes are not
a problem for pg_upgrade.

I also think making it user-tunable is so hard for users to know when to
adjust as to be almost not worth the user interface complexity it adds.

I suggest we go with always-on prefix/suffix mode, then add some check
so the worst case is avoided by just giving up on compression.

As I said previously, I think compressing the page images is the next
big win in this area.

I think here one might argue that for some users it is not feasible to
decide whether their tuples data for UPDATE is going to be similar
or completely different and they are not at all ready for any risk for
CPU overhead, but they would be happy to see I/O reduction in which
case it is difficult to decide what should be the value of table-level
switch. Here I think the only answer is "nothing is free" in this world,
so either make sure about the application's behaviour for UPDATE
statement before going to production or just don't enable this switch and
be happy with the current behaviour.

Again, can't set do a minimal attempt at prefix/suffix compression so
there is no measurable overhead?

On the other side there will be users who will be pretty certain about their
usage of UPDATE statement or atleast are ready to evaluate their
application if they can get such a huge gain, so it would be quite useful
feature for such users.

can we move move forward with the full-page compression patch?

In my opinion, it is not certain that whatever compression algorithm got
decided for this patch (if any) can be directly used for full-page
compression, some ideas could be used or may be the algorithm could be
tweaked a bit to make it usable for full-page compression.

Thanks, I understand that now.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Bruce Momjian (#106)

On Wed, Feb 12, 2014 at 8:19 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Feb 12, 2014 at 10:02:32AM +0530, Amit Kapila wrote:

I think 99.9% of users are never going to adjust this so we had better
choose something we are happy to enable for effectively everyone. In my
reading, prefix/suffix seemed safe for everyone. We can always revisit
this if we think of something better later, as WAL format changes are not
a problem for pg_upgrade.

Agreed.

I also think making it user-tunable is so hard for users to know when to
adjust as to be almost not worth the user interface complexity it adds.

I suggest we go with always-on prefix/suffix mode, then add some check
so the worst case is avoided by just giving up on compression.

As I said previously, I think compressing the page images is the next
big win in this area.

I think here one might argue that for some users it is not feasible to
decide whether their tuples data for UPDATE is going to be similar
or completely different and they are not at all ready for any risk for
CPU overhead, but they would be happy to see I/O reduction in which
case it is difficult to decide what should be the value of table-level
switch. Here I think the only answer is "nothing is free" in this world,
so either make sure about the application's behaviour for UPDATE
statement before going to production or just don't enable this switch and
be happy with the current behaviour.

Again, can't set do a minimal attempt at prefix/suffix compression so
there is no measurable overhead?

Yes, currently it is there at 25%, which means there should be atleast 25%
match in prefix-suffix, then only we consider it for compression and that
is pretty fast and almost no overhead, but the worst case here is other
way i.e when the string has 25% match in prefix-suffix, but after that
there is no match or at least in next few bytes there is no match.

For example, consider below 2 cases:

Case-1

old tuple
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

new tuple
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbaaaaaaaaaaaaaaaaaaaaaaaaa

Here there is a suffix match for 25% of string, but after that there is no
match, so we have to copy all the 75% remaining bytes as it is byte-by-byte.
Now here with bit longer tuples (800 bytes), the performance data taken be me
shows around ~11% of CPU overhead. Now as this test is a fabricated test
to just see how much extra CPU it consumes for worst scenario, in reality
user might not see this, at least in synchronous commit mode on, because
there is always some I/O involved at end of transaction (unless there is some
error in between or user rollbacks transaction chances of which are very less).

First thing that comes to mind after seeing above scenario, is that why not
increase the minimum limit of 25%, because we have almost negligible
overhead in comparing prefix-suffix, so I have tried that by increasing it
to 35% or more but in that case it starts falling from other side like
for cases when there is 34% match and still we return.

Here one of the improvements which can be done is that after prefix-suffix
match, instead of going byte-by-byte copy as per LZ format we can directly
copy all the remaining part of tuple but I think that would require us to use
some different format than LZ which is also not too difficult to do, but the
question is do we really need such a change to handle the above kind of
worst case.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108

Claudio Freire

klaussfreire@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#107)

On Thu, Feb 13, 2014 at 1:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Here one of the improvements which can be done is that after prefix-suffix
match, instead of going byte-by-byte copy as per LZ format we can directly
copy all the remaining part of tuple but I think that would require us to use
some different format than LZ which is also not too difficult to do, but the
question is do we really need such a change to handle the above kind of
worst case.

Why use LZ at all? Why not *only* prefix/suffix?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Claudio Freire (#108)

On Thu, Feb 13, 2014 at 10:07 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Thu, Feb 13, 2014 at 1:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Here one of the improvements which can be done is that after prefix-suffix
match, instead of going byte-by-byte copy as per LZ format we can directly
copy all the remaining part of tuple but I think that would require us to use
some different format than LZ which is also not too difficult to do, but the
question is do we really need such a change to handle the above kind of
worst case.

Why use LZ at all?

We are just using LZ *format* to represent compressed string.
Just copied some text from pg_lzcompress.c, to explain what
exactly we are using

"the first byte after the header tells what to do
the next 8 times. We call this the control byte.

An unset bit in the control byte means, that one uncompressed
byte follows, which is copied from input to output.
A set bit in the control byte means, that a tag of 2-3 bytes
follows. A tag contains information to copy some bytes, that
are already in the output buffer, to the current location in
the output."

Why not *only* prefix/suffix?

To represent prefix/suffix match, we atleast need a way to tell
that the offset and len of matched bytes and then how much
is the length of unmatched bytes we have copied.
I agree that a simpler format could be devised if we just want to
do prefix-suffix match, but that would require much more test
during recovery to ensure everything is fine, advantage with LZ
format is that we don't need to bother about decoding, it will work
as without any much change in LZ decode routine.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Bruce Momjian (#104)

On Tue, Feb 11, 2014 at 11:37 AM, Bruce Momjian <bruce@momjian.us> wrote:

Yes, that was my point. I though the compression of full-page images
was a huge win and that compression was pretty straight-forward, except
for the compression algorithm. If the compression algorithm issue is
resolved, can we move move forward with the full-page compression patch?

Discussion of the full-page compression patch properly belongs on that
thread rather than this one. However, based on what we've discovered
so far here, I won't be very surprised if that patch turns out to have
serious problems with CPU consumption. The evidence from this thread
suggests that making even relatively lame attempts at compression is
extremely costly in terms of CPU overhead. Now, the issues with
straight-up compression are somewhat different than for delta
compression and, in particular, it's easier to bail out of straight-up
compression sooner if things aren't working out. But even with all
that, I expect it to be not too difficult to find cases where some
compression is achieved but with a dramatic increase in runtime on
CPU-bound workloads. Which is basically the same problem this patch
has.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Amit Kapila (#103)

On Mon, Feb 10, 2014 at 10:02 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think if we want to change LZ format, it will be bit more work and
verification for decoding has to be done much more strenuously.

I don't think it'll be that big of a deal. And anyway, the evidence
here suggests that we still need more speed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112

Bruce Momjian

bruce@momjian.us

almost 12 years ago

In reply to: Amit Kapila (#109)

On Thu, Feb 13, 2014 at 10:20:46AM +0530, Amit Kapila wrote:

Why not *only* prefix/suffix?

To represent prefix/suffix match, we atleast need a way to tell
that the offset and len of matched bytes and then how much
is the length of unmatched bytes we have copied.
I agree that a simpler format could be devised if we just want to
do prefix-suffix match, but that would require much more test
during recovery to ensure everything is fine, advantage with LZ
format is that we don't need to bother about decoding, it will work
as without any much change in LZ decode routine.

Based on the numbers I think prefix/suffix-only needs to be explored.
Consider if you just change one field of a row --- prefix/suffix would
find all the matching parts. If you change the first and last fields,
you get no compression at all, but your prefix/suffix test isn't going
to get that either.

As I understand it, the only place prefix/suffix with LZ compression is
a win over prefix/suffix-only is when you change two middle fields, and
there are common fields unchanged between them. If we are looking at
11% CPU overhead for that, it isn't worth it.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Robert Haas (#111)

2 attachment(s)

On Thu, Feb 13, 2014 at 10:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Feb 10, 2014 at 10:02 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think if we want to change LZ format, it will be bit more work and
verification for decoding has to be done much more strenuously.

I don't think it'll be that big of a deal. And anyway, the evidence
here suggests that we still need more speed.

Okay. I did one small hack (for unmatched part directly copy it to destination
buffer, instead of getting it through LZ i.e memcpy unchanged data in
destination buffer) in patch to find if format change or doing memcpy instead
of byte-by-byte can give us any benefit and found that it can give benefit, but
may not be very high. We cannot change it like this if we have to do
some change in format, but this is just a quick hack to see if such a change
can give us benefit.

The data is fluctuating as it is purely CPU based test, so what
I have done is that run the same test five times and took the best data
for all 3 the patches. Explanation of changes in 2 patches other than
master is given after data:

Performance Data
-----------------------------
Non-Default settings
checkpoint_segments = 128
checkpoint_timeout = 15 min
full_page_writes = off

wal-update-prefix-suffix-encode-2.patch

wal-update-prefix-suffix-encode-3.patch

Changes in wal-update-prefix-suffix-encode-2.patch
1. Remove the check at end of pgrb_delta_encode() that checks if encoded
buffer has more than 75% of tuple data, as before starting for copying literal
bytes we have already ensured that the compressed data is greater
than >25%, so there should not be any harm in avoiding this check.

Changes in wal-update-prefix-suffix-encode-3.patch
1. Kept change of wal-update-prefix-suffix-encode-2.patch
2. Changed copying of unmatched literal bytes to memcpy

Considering median data for all patches, there is a CPU overhead
of 8.37% with version-2 and there is a CPU gain of 1.11% with
version-3 of patch. Now here there is a small catch that even if
we want to change the LZ format for prefix-suffix encoding, the CPU
data shown above with memcpy might not be same, rather it will
depend on whether we can come up with good format which can give
us same benefit as direct memcpy is giving.

One of the ideas for change in format:
Tag for prefix/suffix match
12 bits - offset
12 bits - length
Value for unmatched data
1 or 2 bytes for length depending on length of data (first bit can indicate
whether we need 1 byte or 2 bytes)
data

Now considering above format let us see how much difference in data
would it create as compare to LZ format. For example, consider the
data of current worst case:
Suffix match ~ 200 bytes
unmatched data ~600 bytes

To represent suffix match, both formats will take same amount of bytes,
For unmatched data, LZ format would take 10 extra bytes (it uses 1bit
to indicate 1 uncompressed byte) where as above changed format will
take 2 bytes, also more the uncompressed data more extra bytes it can
take in LZ format. However for few unchanged bytes (<64), I think
LZ format will use lesser number of bits, but in that case anyway we
will get compression, so loosing few bits should not matter.

I think that CPU overhead less than 5% for worst case could
have been considered acceptable and this is on bit higher side, but
do you think that it is so high that it deserves change in format?

One more idea, I have in mind but still not tried for prefix-suffix match
i.e to try with minimum compression ratio as 30% rather than 25%, not
sure if it can reduce overhead to less than 5% for worst case without
loosing on any other case.

Test used is same as provided in mail:
/messages/by-id/CAA4eK1+k5-Jo3SLHFuSK2Y59TL+zctVVBFGwXawH6KhrLnW6=w@mail.gmail.com

Patch for v-2 and v-3 are attached

Below is the data for 5 runs with all the patches, this is just to show
the fluctuation in data:

Unpatched

wal-update-prefix-suffix-encode-2.patch

wal-update-prefix-suffix-encode-3.patch

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal-update-prefix-suffix-encode-2.patchapplication/octet-stream; name=wal-update-prefix-suffix-encode-2.patchDownload

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index aea9d40..3bf5728 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..2724188 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f63185..5d874eb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2332,6 +2332,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 5135575..cecd768 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -24,7 +24,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	int8.o json.o jsonfuncs.o like.o \
 	lockfuncs.o mac.o misc.o nabstime.o name.o network.o numeric.o \
 	numutils.o oid.o oracle_compat.o orderedsetaggs.o \
-	pg_lzcompress.o pg_locale.o pgstatfuncs.o \
+	pg_lzcompress.o pg_locale.o pg_rbcompress.o pgstatfuncs.o \
 	pseudotypes.o quote.o rangetypes.o rangetypes_gist.o \
 	rangetypes_selfuncs.o rangetypes_spgist.o rangetypes_typanalyze.o \
 	regexp.o regproc.o ri_triggers.o rowtypes.o ruleutils.o \
diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..877ccd7
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,355 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is a delta encoding scheme specific to PostgreSQL and designed
+ *		to compress similar tuples. It can be used as it is or extended for
+ *		other purpose in PostgrSQL if required.
+ *
+ *		Currently, this just checks for a common prefix and/or suffix, but
+ *		the output format is similar to the LZ format used in pg_lzcompress.c.
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_rbcompress.h"
+
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGRB_HISTORY_SIZE		4096
+#define PGRB_MIN_MATCH			4
+
+
+
+/* ----------
+ * The provided standard strategies
+ * ----------
+ */
+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	35,							/* Require 25% compression rate, or not worth
+								 * it */
+};
+const PGRB_Strategy *const PGRB_strategy_default = &strategy_default_data;
+
+
+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pgrb_delta_encode - find common prefix/suffix between inputs and encode.
+ *
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	int32		result_size;
+	int32		result_max;
+	int32		need_rate;
+	int			prefixlen;
+	int			suffixlen;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 * XXX: still true?
+	 */
+	if (hlen >= PGRB_HISTORY_SIZE || hlen < PGRB_MIN_MATCH)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	/*if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		/*result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+	{
+		result_max = (slen * (100 - need_rate)) / 100;
+	}*/
+
+	for (prefixlen = 0; prefixlen < hlen && prefixlen < slen; prefixlen++)
+	{
+		if (history[prefixlen] != source[prefixlen])
+			break;
+	}
+	if (prefixlen < PGRB_MIN_MATCH)
+		prefixlen = 0;
+
+	hp = &history[hlen - 1];
+	dp = &source[slen - 1];
+	suffixlen = 0;
+	while (hp >= &history[prefixlen] && dp >= &source[prefixlen])
+	{
+		if (*hp != *dp)
+			break;
+		hp--;
+		dp--;
+		suffixlen++;
+	}
+	if (suffixlen < PGRB_MIN_MATCH)
+		suffixlen = 0;
+
+	/* FIXME: need to be more careful here, to make sure we don't
+	 * overflow the buffer!
+	 */
+	/*if (slen - prefixlen - suffixlen > (slen * need_rate) / 100)
+		return false;*/
+	if (prefixlen + suffixlen < (slen * need_rate) / 100)
+		return false;
+
+	/* Ok, this is worth delta encoding. */
+
+	/* output prefix as a tag */
+	pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, prefixlen, hlen);
+
+	/* output bytes between prefix and suffix as literals */
+	dp = &source[prefixlen];
+	dend = &source[slen - suffixlen];
+	while (dp < dend)
+	{
+		pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/* output suffix as a tag */
+	pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, suffixlen, 0);
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+	/*if (result_size > result_max)
+		return false;*/
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 11ab277..87402b0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_rbcompress.h b/src/include/utils/pg_rbcompress.h
new file mode 100644
index 0000000..effba23
--- /dev/null
+++ b/src/include/utils/pg_rbcompress.h
@@ -0,0 +1,58 @@
+/* ----------
+ * pg_rbcompress.h -
+ *
+ *	Definitions for the PostgreSQL specific encoding scheme
+ *
+ * src/include/utils/pg_rbcompress.h
+ * ----------
+ */
+
+#ifndef _PG_RBCOMPRESS_H_
+#define _PG_RBCOMPRESS_H_
+
+
+/* ----------
+ * PGRB_Strategy -
+ *
+ *		Some values that control the compression algorithm.
+ *
+ *		min_input_size		Minimum input data size to consider compression.
+ *
+ *		max_input_size		Maximum input data size to consider compression.
+ *
+ *		min_comp_rate		Minimum compression rate (0-99%) to require.
+ *							Regardless of min_comp_rate, the output must be
+ *							smaller than the input, else we don't store
+ *							compressed.
+ * ----------
+ */
+typedef struct PGRB_Strategy
+{
+	int32		min_input_size;
+	int32		max_input_size;
+	int32		min_comp_rate;
+} PGRB_Strategy;
+
+
+/* ----------
+ * The standard strategies
+ *
+ *		PGRB_strategy_default		Recommended default strategy for WAL
+ *									compression.
+ * ----------
+ */
+extern const PGRB_Strategy *const PGRB_strategy_default;
+
+
+/* ----------
+ * Global function declarations
+ * ----------
+ */
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGRB_Strategy *strategy);
+extern void pgrb_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
+
+#endif   /* _PG_RBCOMPRESS_H_ */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

wal-update-prefix-suffix-encode-3.patchapplication/octet-stream; name=wal-update-prefix-suffix-encode-3.patchDownload

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index aea9d40..3bf5728 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pgrb_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen, NULL
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pgrb_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..2724188 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_rbcompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f63185..5d874eb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2332,6 +2332,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 5135575..cecd768 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -24,7 +24,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	int8.o json.o jsonfuncs.o like.o \
 	lockfuncs.o mac.o misc.o nabstime.o name.o network.o numeric.o \
 	numutils.o oid.o oracle_compat.o orderedsetaggs.o \
-	pg_lzcompress.o pg_locale.o pgstatfuncs.o \
+	pg_lzcompress.o pg_locale.o pg_rbcompress.o pgstatfuncs.o \
 	pseudotypes.o quote.o rangetypes.o rangetypes_gist.o \
 	rangetypes_selfuncs.o rangetypes_spgist.o rangetypes_typanalyze.o \
 	regexp.o regproc.o ri_triggers.o rowtypes.o ruleutils.o \
diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..3218a98
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,359 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is a delta encoding scheme specific to PostgreSQL and designed
+ *		to compress similar tuples. It can be used as it is or extended for
+ *		other purpose in PostgrSQL if required.
+ *
+ *		Currently, this just checks for a common prefix and/or suffix, but
+ *		the output format is similar to the LZ format used in pg_lzcompress.c.
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_rbcompress.h"
+
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGRB_HISTORY_SIZE		4096
+#define PGRB_MIN_MATCH			4
+
+
+
+/* ----------
+ * The provided standard strategies
+ * ----------
+ */
+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	25,							/* Require 25% compression rate, or not worth
+								 * it */
+};
+const PGRB_Strategy *const PGRB_strategy_default = &strategy_default_data;
+
+
+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pgrb_delta_encode - find common prefix/suffix between inputs and encode.
+ *
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	int32		result_size;
+	int32		result_max;
+	int32		need_rate;
+	int			prefixlen;
+	int			suffixlen;
+	int			unmatched_len;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 * XXX: still true?
+	 */
+	if (hlen >= PGRB_HISTORY_SIZE || hlen < PGRB_MIN_MATCH)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;
+
+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;
+
+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	/*if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		/*result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+	{
+		result_max = (slen * (100 - need_rate)) / 100;
+	}*/
+
+	for (prefixlen = 0; prefixlen < hlen && prefixlen < slen; prefixlen++)
+	{
+		if (history[prefixlen] != source[prefixlen])
+			break;
+	}
+	if (prefixlen < PGRB_MIN_MATCH)
+		prefixlen = 0;
+
+	hp = &history[hlen - 1];
+	dp = &source[slen - 1];
+	suffixlen = 0;
+	while (hp >= &history[prefixlen] && dp >= &source[prefixlen])
+	{
+		if (*hp != *dp)
+			break;
+		hp--;
+		dp--;
+		suffixlen++;
+	}
+	if (suffixlen < PGRB_MIN_MATCH)
+		suffixlen = 0;
+
+	/* FIXME: need to be more careful here, to make sure we don't
+	 * overflow the buffer!
+	 */
+	/*if (slen - prefixlen - suffixlen > (slen * need_rate) / 100)
+		return false;*/
+	if (prefixlen + suffixlen < (slen * need_rate) / 100)
+		return false;
+
+	/* Ok, this is worth delta encoding. */
+
+	/* output prefix as a tag */
+	pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, prefixlen, hlen);
+
+	/* output bytes between prefix and suffix as literals */
+	dp = &source[prefixlen];
+	dend = &source[slen - suffixlen];
+	unmatched_len = dend - dp;
+	memcpy(bp, dp, unmatched_len);
+	bp += unmatched_len;
+	/*while (dp < dend)
+	{
+		pgrb_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	/*}*/
+
+	/* output suffix as a tag */
+	pgrb_out_tag(ctrlp, ctrlb, ctrl, bp, suffixlen, 0);
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+	/*if (result_size > result_max)
+		return false;*/
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pgrb_delta_decode
+ *
+ *		Decompresses source into dest.
+ * ----------
+ */
+void
+pgrb_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 11ab277..87402b0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_rbcompress.h b/src/include/utils/pg_rbcompress.h
new file mode 100644
index 0000000..effba23
--- /dev/null
+++ b/src/include/utils/pg_rbcompress.h
@@ -0,0 +1,58 @@
+/* ----------
+ * pg_rbcompress.h -
+ *
+ *	Definitions for the PostgreSQL specific encoding scheme
+ *
+ * src/include/utils/pg_rbcompress.h
+ * ----------
+ */
+
+#ifndef _PG_RBCOMPRESS_H_
+#define _PG_RBCOMPRESS_H_
+
+
+/* ----------
+ * PGRB_Strategy -
+ *
+ *		Some values that control the compression algorithm.
+ *
+ *		min_input_size		Minimum input data size to consider compression.
+ *
+ *		max_input_size		Maximum input data size to consider compression.
+ *
+ *		min_comp_rate		Minimum compression rate (0-99%) to require.
+ *							Regardless of min_comp_rate, the output must be
+ *							smaller than the input, else we don't store
+ *							compressed.
+ * ----------
+ */
+typedef struct PGRB_Strategy
+{
+	int32		min_input_size;
+	int32		max_input_size;
+	int32		min_comp_rate;
+} PGRB_Strategy;
+
+
+/* ----------
+ * The standard strategies
+ *
+ *		PGRB_strategy_default		Recommended default strategy for WAL
+ *									compression.
+ * ----------
+ */
+extern const PGRB_Strategy *const PGRB_strategy_default;
+
+
+/* ----------
+ * Global function declarations
+ * ----------
+ */
+extern bool pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen, const PGRB_Strategy *strategy);
+extern void pgrb_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
+
+#endif   /* _PG_RBCOMPRESS_H_ */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..af46df2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuos and non continuos columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..1806992 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuos and non continuos columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#114

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Amit Kapila (#113)

Hi,

Some quick review comments:

On 2014-02-13 18:14:54 +0530, Amit Kapila wrote:

+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;

You should note that thew check for RelationIsEnabledForWalCompression()
here is racy and that that's ok because the worst that can happen is
that a uselessly generated delta.

xlrec.target.node = reln->rd_node;
xlrec.target.tid = oldtup->t_self;
xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
xlrec.newtid = newtup->t_self;
if (new_all_visible_cleared)
xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;

I think this also needs to unset XLOG_HEAP_CONTAINS_NEW_TUPLE and
conditional on !need_tuple_data.

/*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{

Should note very, very boldly that this can only be used in contexts
where a race is acceptable.

diff --git a/src/backend/utils/adt/pg_rbcompress.c b/src/backend/utils/adt/pg_rbcompress.c
new file mode 100644
index 0000000..877ccd7
--- /dev/null
+++ b/src/backend/utils/adt/pg_rbcompress.c
@@ -0,0 +1,355 @@
+/* ----------
+ * pg_rbcompress.c -
+ *
+ *		This is a delta encoding scheme specific to PostgreSQL and designed
+ *		to compress similar tuples. It can be used as it is or extended for
+ *		other purpose in PostgrSQL if required.
+ *
+ *		Currently, this just checks for a common prefix and/or suffix, but
+ *		the output format is similar to the LZ format used in pg_lzcompress.c.
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */

This needs significantly more explanations about the algorithm and the
reasoning behind it.

+static const PGRB_Strategy strategy_default_data = {
+	32,							/* Data chunks less than 32 bytes are not
+								 * compressed */
+	INT_MAX,					/* No upper limit on what we'll try to
+								 * compress */
+	35,							/* Require 25% compression rate, or not worth
+								 * it */
+};

compression rate looks like it's mismatch between comment and code.

+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+

double underscore variables are reserved for the compiler and os.

+/* ----------
+ * pgrb_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgrb_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgrb_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+

What's the reason to use macros here? Just use inline functions when
dealing with file-local stuff.

+/* ----------
+ * pgrb_delta_encode - find common prefix/suffix between inputs and encode.
+ *
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pgrb_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen,
+				  const PGRB_Strategy *strategy)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	int32		result_size;
+	int32		result_max;
+	int32		need_rate;
+	int			prefixlen;
+	int			suffixlen;
+
+	/*
+	 * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 * XXX: still true?
+	 */

Why didn't you define a maximum tuple size in the strategy definition
above then?

+	if (hlen >= PGRB_HISTORY_SIZE || hlen < PGRB_MIN_MATCH)
+		return false;
+
+	/*
+	 * Our fallback strategy is the default.
+	 */
+	if (strategy == NULL)
+		strategy = PGRB_strategy_default;

+	/*
+	 * If the strategy forbids compression (at all or if source chunk size out
+	 * of range), fail.
+	 */
+	if (slen < strategy->min_input_size ||
+		slen > strategy->max_input_size)
+		return false;
+
+	need_rate = strategy->min_comp_rate;
+	if (need_rate < 0)
+		need_rate = 0;
+	else if (need_rate > 99)
+		need_rate = 99;

Is there really need for all this stuff here? This is so specific to the
usecase that I have significant doubts that all the pglz boiler plate
makes much sense.

+	/*
+	 * Compute the maximum result size allowed by the strategy, namely the
+	 * input size minus the minimum wanted compression rate.  This had better
+	 * be <= slen, else we might overrun the provided output buffer.
+	 */
+	/*if (slen > (INT_MAX / 100))
+	{
+		/* Approximate to avoid overflow */
+		/*result_max = (slen / 100) * (100 - need_rate);
+	}
+	else
+	{
+		result_max = (slen * (100 - need_rate)) / 100;
+	}*/

err?

+--
+-- Test to update continuos and non continuos columns
+--

*continuous

I have to admit, I have serious doubts about this approach. I have a
very hard time believing this won't cause performance regression in many
common cases... More importantly I don't think doing the compression on
this level is that interesting. I know Heikki argued for it, but I think
extending the bitmap that's computed for HOT to cover all columns and
doing this on a column level sounds much more sensible to me.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Andres Freund (#114)

On Sat, Feb 15, 2014 at 8:21 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Hi,

Some quick review comments:

Thanks for the review, I shall handle/reply to comments with the
updated version in which I am planing to fix a bug (right now preparing a
test to reproduce it) in this code.
Bug:
Tag can handle maximum length of 273 bytes, but this patch is not
considering it.

I have to admit, I have serious doubts about this approach. I have a
very hard time believing this won't cause performance regression in many
common cases...

Actually, till now I was majorly focusing on worst case (i.e at boundary of
compression ratio) thinking that most others will do good. However I shall
produce data for much more common cases as well.
Please let me know if you have anything specific thing in mind where this
will not work well.

More importantly I don't think doing the compression on
this level is that interesting. I know Heikki argued for it, but I think
extending the bitmap that's computed for HOT to cover all columns and
doing this on a column level sounds much more sensible to me.

Previously we have tried to do at column boundaries, but the main problem
turned out to be in worst cases where we spend time in extracting values
from tuples based on column boundaries and later found that data is not
compressible.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Amit Kapila (#115)

On 2014-02-15 21:01:07 +0530, Amit Kapila wrote:

More importantly I don't think doing the compression on
this level is that interesting. I know Heikki argued for it, but I think
extending the bitmap that's computed for HOT to cover all columns and
doing this on a column level sounds much more sensible to me.

Previously we have tried to do at column boundaries, but the main problem
turned out to be in worst cases where we spend time in extracting values
from tuples based on column boundaries and later found that data is not
compressible.

I think that hugely depends on how you implement it. I think you'd need
to have a loop traversing over the both tuples at the same time on the
level of heap_deform_tuple(). If you'd use the result to get rid of
HeapSatisfiesHOTandKeyUpdate() at the same time I am pretty sure you
wouldn't see very high overhead.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#91)

4 attachment(s)

On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I'm pretty sure the overhead of that would be negligible, so we could always
enable it. There are certainly a lot of scenarios where prefix/suffix
detection alone wouldn't help, but so what.

Attached is a quick patch for that, if you want to test it.

I have updated the patch to correct few problems, addressed review comments
by Andres and done some optimizations to improve CPU overhead for worst
case. Let me first start with performance data for this patch.

Performance Data
-----------------------------
Non-Default settings

autovacuum = off
checkpoitnt_segments = 256
checkpoint_timeout =15min
full_page_writes = off

Unpatched

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 573502736 | 9.70863103866577
one short and one long field, no change | 573504920 | 10.1023359298706
one short and one long field, no change | 573498936 | 9.84194612503052
hundred tiny fields, all changed | 364891128 | 13.9618089199066
hundred tiny fields, all changed | 364888088 | 13.4061119556427
hundred tiny fields, all changed | 367753480 | 13.433109998703
hundred tiny fields, half changed | 364892928 | 13.5090639591217
hundred tiny fields, half changed | 364890384 | 13.5632100105286
hundred tiny fields, half changed | 364888136 | 13.6033401489258
hundred tiny fields, half nulled | 300702272 | 13.7366359233856
hundred tiny fields, half nulled | 300703656 | 14.5007920265198
hundred tiny fields, half nulled | 300705216 | 13.9954152107239
9 short and 1 long, short changed | 396987760 | 9.5885021686554
9 short and 1 long, short changed | 396988864 | 9.11789703369141
9 short and 1 long, short changed | 396985728 | 9.52586102485657
(15 rows)

wal-update-prefix-suffix-encode-4.patch

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 156854192 | 6.74417304992676
one short and one long field, no change | 156279384 | 6.61455297470093
one short and one long field, no change | 156277824 | 6.84297394752502
hundred tiny fields, all changed | 364893056 | 13.9131588935852
hundred tiny fields, all changed | 364890912 | 13.1628270149231
hundred tiny fields, all changed | 364889424 | 13.7095680236816
hundred tiny fields, half changed | 364895592 | 13.6322529315948
hundred tiny fields, half changed | 365172160 | 14.0036828517914
hundred tiny fields, half changed | 364892400 | 13.5247440338135
hundred tiny fields, half nulled | 206833992 | 12.4429869651794
hundred tiny fields, half nulled | 208443760 | 12.1079058647156
hundred tiny fields, half nulled | 205858280 | 12.7899498939514
9 short and 1 long, short changed | 236516832 | 8.36392688751221
9 short and 1 long, short changed | 236515744 | 8.46648907661438
9 short and 1 long, short changed | 236518336 | 8.02749991416931
(15 rows)

There is major WAL reduction and CPU gain for best and average
cases and for cases where there is no WAL reduction (as updated
tuple has different data), there is no CPU overhead.

Test script (wal-update-testsuite.sh) to collect above data is attached.

Now for the worst case where the tuple has same data till compression
ratio, I have tried to keep the compression rate at 25 and 30%,
and observed that there is quite minimal overhead at 30%. Performance
data for same is as below:

Case - 1 : Change some bytes just after 30% of tuple
Unpatched

wal-update-prefix-suffix-encode-4.patch

Case - 2 : Change some bytes just before 30% of tuple
Unpatched

wal-update-prefix-suffix-encode-4.patch

Keeping the compression rate at 30% for case when match is till
~29%, there is about ~2% CPU overhead (considering median data)
and when there is a match till ~31%, there is a WAL reduction of 20%
and no CPU overhead.

Now if we keep compression rate at 25%, it will perform better when
match till 24%, but will perform bad when match till 26%.

I have attached separate scripts for both (25% & 30%) boundary tests
(wal-update-testsuite-ten-long-8-bytes-changed-at-30-percent-boundary.sh &
wal-update-testsuite-ten-long-8-bytes-changed-at-25-percent-boundary.sh).
You can change value of PGDE_MIN_COMP_RATE in patch to run
the test, current it is 30, if you want to run 25% boundary test, then
change it to 25.

Note - Performance data for worst case was fluctuating, so I have took
5 times and get the difference of data which occurred most number of
times.

About the main changes in this patch:

1. An un-necessary Tag was getting added to encoded tuple, even
when there is no match for prefix/suffix.
2. Maximum Tag length was not considered, done changes to split
the tag if the length is greater than 273 bytes (max tag length supported
by format).
3. Check for whether prefix/suffix length has achieved compression ratio
was wrong. Changed the code for same.
4. Even after we decide after prefix/suffix match has achieved compression
ratio, at the end of encode function it was returning based on max size, which
I think is not required as buffer has sufficient space and it was
causing overhead
for worst cases. If we just want to be extra careful, then we might want to have
a check for max buffer size passed to encode function.
5. Change file names to pg_decompress.c/.h(de - delta encoding)

Fixes for review comments by Andres

+
+     if (RelationIsEnabledForWalCompression(reln) &&
+             (oldbuf == newbuf) &&
+             !XLogCheckBufferNeedsBackup(newbuf))
+     {
+             uint32          enclen;

You should note that thew check for RelationIsEnabledForWalCompression()
here is racy and that that's ok because the worst that can happen is
that a uselessly generated delta.

I think we might not even keep this switch, as performance data seems to
be okay, but yes even if we keep it might not do any harm.

+     if (compressed)
+             xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;

I think this also needs to unset XLOG_HEAP_CONTAINS_NEW_TUPLE and
conditional on !need_tuple_data.

I could not understand this point, from above sentence it seems you want
to unset XLOG_HEAP_CONTAINS_NEW_TUPLE when !need_tuple_data,
but not sure, could you explain bit more on it.

+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{

Should note very, very boldly that this can only be used in contexts
where a race is acceptable.

Yes, this is racy, however in worst case it will do encoding
when it is not required or won't do encoding when it can save WAL, but
both cases doesn't do any harm.

+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_rbcompress.c
+ * ----------
+ */

This needs significantly more explanations about the algorithm and the
reasoning behind it.

Agreed. I have added more explanations and reasoning to choose
algorithm.

+static const PGRB_Strategy strategy_default_data = {
+     32,                                                     /* Data chunks less than 32 bytes are not
+                                                              * compressed */
+     INT_MAX,                                        /* No upper limit on what we'll try to
+                                                              * compress */
+     35,                                                     /* Require 25% compression rate, or not worth
+                                                              * it */
+};

compression rate looks like it's mismatch between comment and code.

Corrected. Now I have removed this strategy structure itself and instead
do define for the values, as we don't require multiple strategies for
this encoding.

+/* ----------
+ * pgrb_out_ctrl -
+ *
+ *           Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+     if ((__ctrl & 0xff) == 0)                                                                                               \
+     {                                                                                                                                               \
+             *(__ctrlp) = __ctrlb;                                                                                           \
+             __ctrlp = (__buf)++;                                                                                            \
+             __ctrlb = 0;                                                                                                            \
+             __ctrl = 1;                                                                                                                     \
+     }                                                                                                                                               \
+} while (0)
+

double underscore variables are reserved for the compiler and os.

These macro's are mostly same as pg_lz as we have not changed the
format for encoded buffer. There are couple of other places like like.c
and valid.h where double underscores are used in macro's. However
I think there is no compelling need to use it and neither is recommended
way for variables.
I am not sure why this is used originally in pg_lzcompress.c, so for now
lets keep it as it is, I will change these names if we decide to go with this
version of patch. Right now the main decision is about performance
data, If that is done, I will change it along with some other similar changes.

+#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+     pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);                                                                \
+     *(_buf)++ = (unsigned char)(_byte);                                                                             \
+     _ctrl <<= 1;                                                                                                                    \
+} while (0)

What's the reason to use macros here? Just use inline functions when
dealing with file-local stuff.

Again same reason as above. Basically copied from pg_lz as the
encoding format is same.

+
+     /*
+      * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
+      * delta encode as this is the maximum size of history offset.
+      * XXX: still true?
+      */

Why didn't you define a maximum tuple size in the strategy definition
above then?

Now I have removed this strategy structure itself and instead
do define for the values, as we don't require multiple strategies for
this encoding.

+     need_rate = strategy->min_comp_rate;
+     if (need_rate < 0)
+             need_rate = 0;
+     else if (need_rate > 99)
+             need_rate = 99;

Is there really need for all this stuff here? This is so specific to the
usecase that I have significant doubts that all the pglz boiler plate
makes much sense.

Agreed. I have removed these extra checks.

+     else
+     {
+             result_max = (slen * (100 - need_rate)) / 100;
+     }*/

err?

Fixed.

+--
+-- Test to update continuos and non continuos columns
+--

*continuous

Fixed.

Previously we have tried to do at column boundaries, but the main problem
turned out to be in worst cases where we spend time in extracting values
from tuples based on column boundaries and later found that data is not
compressible.

I think that hugely depends on how you implement it. I think you'd need
to have a loop traversing over the both tuples at the same time on the
level of heap_deform_tuple(). If you'd use the result to get rid of
HeapSatisfiesHOTandKeyUpdate() at the same time I am pretty sure you
wouldn't see very high overhead

The case where it can have more overhead is, let us say you compress
and later found that its not HOT update, then you have to go and log
new tuple as it is which will waste cycles for doing compression.
We have to always find whether it is HOT update or not, but we might
choose to give up on tuple compression in between based on compression
ratio in which case it might have overhead.
It sounds that for best and average cases, this strategy might work even
better than current methods tried, but we can't be sure about negative
scenarios.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal-update-prefix-suffix-encode-4.patchapplication/octet-stream; name=wal-update-prefix-suffix-encode-4.patchDownload

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index e0b8a4e..c4ac2bd 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1014,6 +1014,22 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>wal_compress_update</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Enables or disables the WAL tuple compression for <command>UPDATE</>
+      on this table.  Default value of this option is false to maintain
+      backward compatability for the command. If true, all the update
+      operations on this table which will place the new tuple on same page
+      as it's original tuple will compress the WAL for new tuple and
+      subsequently reduce the WAL volume.  It is recommended to enable
+      this option for tables where <command>UPDATE</> changes less than
+      50 percent of tuple data.
+     </para>
+     </listitem>
+    </varlistentry>
+
    </variablelist>
 
   </refsect2>
diff --git a/src/backend/access/common/heaptuple.c b/src/backend/access/common/heaptuple.c
index aea9d40..2e9cc0e 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -60,6 +60,7 @@
 #include "access/sysattr.h"
 #include "access/tuptoaster.h"
 #include "executor/tuptable.h"
+#include "utils/pg_decompress.h"
 
 
 /* Does att's datatype allow packing into the 1-byte-header varlena format? */
@@ -617,6 +618,44 @@ heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
 	memcpy((char *) dest->t_data, (char *) src->t_data, src->t_len);
 }
 
+/* ----------------
+ * heap_delta_encode
+ *
+ *		Calculate the delta between two tuples and generate
+ *  encoded wal tuple (EWT), using pgrb. The result is stored
+ *  in *encdata.
+ * ----------------
+ */
+bool
+heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup, HeapTuple newtup,
+				  char *encdata, uint32 *enclen)
+{
+	return pg_delta_encode(
+		(char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		newtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		(char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+		oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits),
+		encdata, enclen
+		);
+}
+
+/* ----------------
+ * heap_delta_decode
+ *
+ *		Decode a tuple using delta-encoded WAL tuple and old tuple version.
+ * ----------------
+ */
+void
+heap_delta_decode(char *encdata, uint32 enclen, HeapTuple oldtup, HeapTuple newtup)
+{
+	pg_delta_decode(encdata, enclen,
+			 (char *) newtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 MaxHeapTupleSize - offsetof(HeapTupleHeaderData, t_bits),
+			 &newtup->t_len,
+			 (char *) oldtup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+			 oldtup->t_len - offsetof(HeapTupleHeaderData, t_bits));
+}
+
 /*
  * heap_form_tuple
  *		construct a tuple from the given values[] and isnull[] arrays,
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..2123a61 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"wal_compress_update",
+			"Compress the wal tuple for update operation on this relation",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"wal_compress_update", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, wal_compress_update)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..fec53e6 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -70,6 +70,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/pg_decompress.h"
 
 
 /* GUC variable */
@@ -6597,6 +6598,12 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogRecPtr	recptr;
 	XLogRecData rdata[7];
 	Page		page = BufferGetPage(newbuf);
+	char	   *newtupdata;
+	int			newtuplen;
+	bool		compressed = false;
+
+	/* Structure which holds EWT */
+	char		buf[MaxHeapTupleSize];
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
 
 	/* Caller should not call me on a non-WAL-logged relation */
@@ -6607,6 +6614,37 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	newtupdata = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+	newtuplen = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * EWT can be generated for all new tuple versions created by Update
+	 * operation. Currently we do it when both the old and new tuple versions
+	 * are on same page, because during recovery if the page containing old
+	 * tuple is corrupt, it should not cascade that corruption to other pages.
+	 * Under the general assumption that for long runs most updates tend to
+	 * create new tuple version on same page, there should not be significant
+	 * impact on WAL reduction or performance.
+	 *
+	 * We should not generate EWT when we need to backup the whole block in
+	 * WAL as in that case there is no saving by reduced WAL size.
+	 */
+
+	if (RelationIsEnabledForWalCompression(reln) &&
+		(oldbuf == newbuf) &&
+		!XLogCheckBufferNeedsBackup(newbuf))
+	{
+		uint32		enclen;
+
+		/* Delta-encode the new tuple using the old tuple */
+		if (heap_delta_encode(reln->rd_att, oldtup, newtup, buf, &enclen))
+		{
+			compressed = true;
+			newtupdata = buf;
+			newtuplen = enclen;
+		}
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,6 +6657,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (compressed)
+		xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapUpdate;
@@ -6634,7 +6674,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	xlhdr.t_len = newtuplen;
 
 	/*
 	 * As with insert records, we need not store the rdata[2] segment
@@ -6647,10 +6687,13 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	rdata[2].buffer_std = true;
 	rdata[2].next = &(rdata[3]);
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data OR
+	 * PG94FORMAT [If encoded]: Control byte + history reference (2 - 3)bytes
+	 *							+ literal byte + ...
+	 */
+	rdata[3].data = newtupdata;
+	rdata[3].len = newtuplen;
 	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
 	rdata[3].buffer_std = true;
 	rdata[3].next = NULL;
@@ -7739,7 +7782,10 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData newtup;
+	HeapTupleData oldtup;
 	HeapTupleHeader htup;
+	HeapTupleHeader oldtupdata = NULL;
 	struct
 	{
 		HeapTupleHeaderData hdr;
@@ -7814,7 +7860,7 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
 		elog(PANIC, "heap_update_redo: invalid lp");
 
-	htup = (HeapTupleHeader) PageGetItem(page, lp);
+	oldtupdata = htup = (HeapTupleHeader) PageGetItem(page, lp);
 
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -7923,10 +7969,31 @@ newsame:;
 	Assert(newlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
+
+	/*
+	 * If the record is EWT then decode it.
+	 */
+	if (xlrec->flags & XLOG_HEAP_DELTA_ENCODED)
+	{
+		/*
+		 * PG94FORMAT: Control byte + history reference (2 - 3)bytes
+		 * + literal byte + ...
+		 */
+		oldtup.t_data = oldtupdata;
+		oldtup.t_len = ItemIdGetLength(lp);
+		newtup.t_data = htup;
+
+		heap_delta_decode((char *) xlrec + hsize, newlen, &oldtup, &newtup);
+		newlen = newtup.t_len;
+	}
+	else
+	{
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) xlrec + hsize,
+			   newlen);
+	}
+
 	newlen += offsetof(HeapTupleHeaderData, t_bits);
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f63185..5d874eb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2332,6 +2332,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 5135575..fabdcb2 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -24,7 +24,7 @@ OBJS = acl.o arrayfuncs.o array_selfuncs.o array_typanalyze.o \
 	int8.o json.o jsonfuncs.o like.o \
 	lockfuncs.o mac.o misc.o nabstime.o name.o network.o numeric.o \
 	numutils.o oid.o oracle_compat.o orderedsetaggs.o \
-	pg_lzcompress.o pg_locale.o pgstatfuncs.o \
+	pg_lzcompress.o pg_locale.o pg_decompress.o pgstatfuncs.o \
 	pseudotypes.o quote.o rangetypes.o rangetypes_gist.o \
 	rangetypes_selfuncs.o rangetypes_spgist.o rangetypes_typanalyze.o \
 	regexp.o regproc.o ri_triggers.o rowtypes.o ruleutils.o \
diff --git a/src/backend/utils/adt/pg_decompress.c b/src/backend/utils/adt/pg_decompress.c
new file mode 100644
index 0000000..53387b9
--- /dev/null
+++ b/src/backend/utils/adt/pg_decompress.c
@@ -0,0 +1,350 @@
+/* ----------
+ * pg_decompress.c -
+ *
+ *		This is a delta encoding scheme specific to PostgreSQL and designed
+ *		to compress similar tuples. It can be used as it is or extended for
+ *		other purpose in PostgrSQL if required.
+ *
+ *		Currently, this just checks for a common prefix and/or suffix, but
+ *		the output format is similar to the LZ format used in pg_lzcompress.c.
+ *		This encoding scheme is developed to compress the tuple using common
+ *		parts of another tuple which is the most common case for UPDATE
+ *		statement. The basic idea used here is to compare prefix and suffix of
+ *		input tuple with history tuple and encode the common part of input
+ *		tuple as a offset/length refrence to history tuple in LZ format.
+ *		We encode the non-matched (part of tuple other than prefix/suffix
+ *		match) as literal bytes.
+ *
+ *		During decoding, history tuple will be available as we maintain
+ *		reference to location of old tuple, so we can easily decode the encoded
+ *		tuple by using offset/length reference of history tuple.
+ *		
+ *		The reason behind choosing this minimalistic form of compression is
+ *		to have very less or no CPU overhead for doing compression as this is
+ *		done for real time UPDATE operation. The other approaches to do
+ *		compression like pglz, have significant overhead of forming history
+ *		table (hash table for history tuple) used for compressing input tuple.
+ *		This particular encoding scheme is quite useful for UPDATE operation,
+ *		as most real world UPDATE operation's have large prefix/suffix match
+ *		between history and input tuple.
+ *
+ * Copyright (c) 1999-2014, PostgreSQL Global Development Group
+ *
+ * src/backend/utils/adt/pg_decompress.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <limits.h>
+
+#include "utils/pg_decompress.h"
+
+
+/* ----------
+ * Local definitions
+ * ----------
+ */
+#define PGDE_HISTORY_SIZE		4096
+#define PGDE_MIN_INPUT_SIZE		32
+#define PGDE_MIN_MATCH			4
+#define PGDE_MAX_MATCH			273
+#define PGDE_MIN_COMP_RATE		30
+
+
+/* ----------
+ * pgde_out_ctrl -
+ *
+ *		Outputs the last and allocates a new control byte if needed.
+ * ----------
+ */
+#define pgde_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
+do { \
+	if ((__ctrl & 0xff) == 0)												\
+	{																		\
+		*(__ctrlp) = __ctrlb;												\
+		__ctrlp = (__buf)++;												\
+		__ctrlb = 0;														\
+		__ctrl = 1;															\
+	}																		\
+} while (0)
+
+
+/* ----------
+ * pgde_out_literal -
+ *
+ *		Outputs a literal byte to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgde_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+do { \
+	pgde_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	*(_buf)++ = (unsigned char)(_byte);										\
+	_ctrl <<= 1;															\
+} while (0)
+
+
+/* ----------
+ * pgde_out_tag -
+ *
+ *		Outputs a backward reference tag of 2-4 bytes (depending on
+ *		offset and length) to the destination buffer including the
+ *		appropriate control bit.
+ * ----------
+ */
+#define pgde_out_tag(_ctrlp,_ctrlb,_ctrl,_buf,_len,_off) \
+do { \
+	pgde_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);								\
+	_ctrlb |= _ctrl;														\
+	_ctrl <<= 1;															\
+	if (_len > 17)															\
+	{																		\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | 0x0f);		\
+		(_buf)[1] = (unsigned char)(((_off) & 0xff));						\
+		(_buf)[2] = (unsigned char)((_len) - 18);							\
+		(_buf) += 3;														\
+	} else {																\
+		(_buf)[0] = (unsigned char)((((_off) & 0xf00) >> 4) | ((_len) - 3)); \
+		(_buf)[1] = (unsigned char)((_off) & 0xff);							\
+		(_buf) += 2;														\
+	}																		\
+} while (0)
+
+/* ----------
+ * pg_delta_encode - find common prefix/suffix between inputs and encode.
+ *
+ *	source is the input data to be compressed
+ *	slen is the length of source data
+ *  history is the data which is used as reference for compression
+ *	hlen is the length of history data
+ *	The encoded result is written to dest, and its length is returned in
+ *	finallen.
+ *	The return value is TRUE if compression succeeded,
+ *	FALSE if not; in the latter case the contents of dest
+ *	are undefined.
+ *	----------
+ */
+bool
+pg_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen)
+{
+	unsigned char *bp = ((unsigned char *) dest);
+	unsigned char *bstart = bp;
+	const char *dp = source;
+	const char *dend = source + slen;
+	const char *hp = history;
+	unsigned char ctrl_dummy = 0;
+	unsigned char *ctrlp = &ctrl_dummy;
+	unsigned char ctrlb = 0;
+	unsigned char ctrl = 0;
+	int32		result_size;
+	int32		need_rate;
+	int			prefixlen, prefixlen_tmp;
+	int			suffixlen;
+	int			suffixoffset;
+
+	/*
+	 * Tuples of length greater than PGDE_HISTORY_SIZE are not allowed for
+	 * delta encode as this is the maximum size of history offset.
+	 * No need of compression for input lesser than PGDE_MIN_INPUT_SIZE.
+	 * XXX: still true?
+	 */
+	if (hlen >= PGDE_HISTORY_SIZE || hlen < PGDE_MIN_MATCH ||
+		slen < PGDE_MIN_INPUT_SIZE || slen > PGDE_HISTORY_SIZE)
+		return false;
+
+	need_rate = PGDE_MIN_COMP_RATE;
+
+	for (prefixlen = 0; prefixlen < hlen && prefixlen < slen; prefixlen++)
+	{
+		if (history[prefixlen] != source[prefixlen])
+			break;
+	}
+	if (prefixlen < PGDE_MIN_MATCH)
+		prefixlen = 0;
+
+	hp = &history[hlen - 1];
+	dp = &source[slen - 1];
+	suffixlen = 0;
+	while (hp >= &history[prefixlen] && dp >= &source[prefixlen])
+	{
+		if (*hp != *dp)
+			break;
+		hp--;
+		dp--;
+		suffixlen++;
+	}
+	if (suffixlen < PGDE_MIN_MATCH)
+		suffixlen = 0;
+
+	if (prefixlen + suffixlen < (slen * need_rate) / 100)
+		return false;
+
+	/* Ok, this is worth delta encoding. */
+
+	/*
+	 * Output prefix as a tag. Split tags having length greater than
+	 * PGDE_MAX_MATCH.
+	 */
+	if (prefixlen > 0)
+	{
+		prefixlen_tmp = prefixlen;
+		while (prefixlen_tmp > PGDE_MAX_MATCH)
+		{
+			pgde_out_tag(ctrlp, ctrlb, ctrl, bp, PGDE_MAX_MATCH, hlen);
+			prefixlen_tmp -= PGDE_MAX_MATCH;
+			hlen -= PGDE_MAX_MATCH;
+		}
+		if (prefixlen_tmp > 0)
+			pgde_out_tag(ctrlp, ctrlb, ctrl, bp, prefixlen_tmp, hlen);
+	}
+
+	/* output bytes between prefix and suffix as literals. */
+	dp = &source[prefixlen];
+	dend = &source[slen - suffixlen];
+	while (dp < dend)
+	{
+		pgde_out_literal(ctrlp, ctrlb, ctrl, bp, *dp);
+		dp++;					/* Do not do this ++ in the line above! */
+	}
+
+	/*
+	 * Output suffix as a tag. Split tags having length greater than
+	 * PGDE_MAX_MATCH.
+	 */
+	if (suffixlen > 0)
+	{
+		suffixoffset = suffixlen;
+		while (suffixlen > PGDE_MAX_MATCH)
+		{
+			pgde_out_tag(ctrlp, ctrlb, ctrl, bp, PGDE_MAX_MATCH, suffixoffset);
+			suffixlen -= PGDE_MAX_MATCH;
+			suffixoffset -= PGDE_MAX_MATCH;
+		}
+		if (suffixlen > 0)
+			pgde_out_tag(ctrlp, ctrlb, ctrl, bp, suffixlen, suffixoffset);
+	}
+
+	/*
+	 * Write out the last control byte and check that we haven't overrun the
+	 * output size allowed by the strategy.
+	 */
+	*ctrlp = ctrlb;
+	result_size = bp - bstart;
+
+#ifdef DELTA_DEBUG
+	elog(LOG, "old %d new %d compressed %d", hlen, slen, result_size);
+#endif
+
+	/*
+	 * Success - need only fill in the actual length of the compressed datum.
+	 */
+	*finallen = result_size;
+
+	return true;
+}
+
+/* ----------
+ * pg_delta_decode
+ *
+ *		Decompresses source into dest.
+ * ----------
+ */
+void
+pg_delta_decode(const char *source, uint32 srclen,
+				  char *dest, uint32 destlen, uint32 *finallen,
+				  const char *history, uint32 histlen)
+{
+	const unsigned char *sp;
+	const unsigned char *srcend;
+	unsigned char *dp;
+	unsigned char *destend;
+	const char *hend;
+
+	sp = ((const unsigned char *) source);
+	srcend = ((const unsigned char *) source) + srclen;
+	dp = (unsigned char *) dest;
+	destend = dp + destlen;
+	hend = history + histlen;
+
+	while (sp < srcend && dp < destend)
+	{
+		/*
+		 * Read one control byte and process the next 8 items (or as many as
+		 * remain in the compressed input).
+		 */
+		unsigned char ctrl = *sp++;
+		int			ctrlc;
+
+		for (ctrlc = 0; ctrlc < 8 && sp < srcend; ctrlc++)
+		{
+			if (ctrl & 1)
+			{
+				/*
+				 * Otherwise it contains the match length minus 3 and the
+				 * upper 4 bits of the offset. The next following byte
+				 * contains the lower 8 bits of the offset. If the length is
+				 * coded as 18, another extension tag byte tells how much
+				 * longer the match really was (0-255).
+				 */
+				int32		len;
+				int32		off;
+
+				len = (sp[0] & 0x0f) + 3;
+				off = ((sp[0] & 0xf0) << 4) | sp[1];
+				sp += 2;
+				if (len == 18)
+					len += *sp++;
+
+				/*
+				 * Check for output buffer overrun, to ensure we don't clobber
+				 * memory in case of corrupt input.  Note: we must advance dp
+				 * here to ensure the error is detected below the loop.  We
+				 * don't simply put the elog inside the loop since that will
+				 * probably interfere with optimization.
+				 */
+				if (dp + len > destend)
+				{
+					dp += len;
+					break;
+				}
+
+				/*
+				 * Now we copy the bytes specified by the tag from history to
+				 * OUTPUT. We can safely use memcpy here because source and
+				 * destination strings will not overlap as in case of LZ.
+				 */
+				memcpy(dp, hend - off, len);
+				dp += len;
+			}
+			else
+			{
+				/*
+				 * An unset control bit means LITERAL BYTE. So we just copy
+				 * one from INPUT to OUTPUT.
+				 */
+				if (dp >= destend)		/* check for buffer overrun */
+					break;		/* do not clobber memory */
+
+				*dp++ = *sp++;
+			}
+
+			/*
+			 * Advance the control bit
+			 */
+			ctrl >>= 1;
+		}
+	}
+
+	/*
+	 * Check we decompressed the right amount.
+	 */
+	if (sp != srcend)
+		elog(PANIC, "compressed data is corrupt");
+
+	/*
+	 * That's it.
+	 */
+	*finallen = ((char *) dp - dest);
+}
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..df64096 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,7 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_DELTA_ENCODED				(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 11ab277..87402b0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/include/utils/pg_decompress.h b/src/include/utils/pg_decompress.h
new file mode 100644
index 0000000..21b936b
--- /dev/null
+++ b/src/include/utils/pg_decompress.h
@@ -0,0 +1,25 @@
+/* ----------
+ * pg_decompress.h -
+ *
+ *	Definitions for the PostgreSQL specific encoding scheme
+ *
+ * src/include/utils/pg_decompress.h
+ * ----------
+ */
+
+#ifndef _PG_DECOMPRESS_H_
+#define _PG_DECOMPRESS_H_
+
+
+/* ----------
+ * Global function declarations
+ * ----------
+ */
+extern bool pg_delta_encode(const char *source, int32 slen,
+				  const char *history, int32 hlen,
+				  char *dest, uint32 *finallen);
+extern void pg_delta_decode(const char *source, uint32 srclen,
+							  char *dest, uint32 destlen, uint32 *finallen,
+							  const char *history, uint32 histlen);
+
+#endif   /* _PG_DECOMPRESS_H_ */
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..717b90b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		wal_compress_update;	/* compress wal tuple for update */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -296,6 +297,15 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
 /*
+ * RelationIsEnabledForWalCompression
+ *		Returns whether the wal for update operation on relation can
+ *      be compressed.
+ */
+#define RelationIsEnabledForWalCompression(relation)	\
+	((relation)->rd_options ?				\
+	 ((StdRdOptions *) (relation)->rd_options)->wal_compress_update : true)
+
+/*
  * RelationIsValid
  *		True iff relation descriptor is valid.
  */
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..bad75a2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuous and non continuous columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..bfe4ecb 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuous and non continuous columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

wal-update-testsuite.shapplication/x-sh; name=wal-update-testsuite.shDownload

wal-update-testsuite-ten-long-8-bytes-changed-at-30-percent-boundary.shapplication/x-sh; name=wal-update-testsuite-ten-long-8-bytes-changed-at-30-percent-boundary.shDownload

wal-update-testsuite-ten-long-8-bytes-changed-at-25-percent-boundary.shapplication/x-sh; name=wal-update-testsuite-ten-long-8-bytes-changed-at-25-percent-boundary.shDownload

#118

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Amit Kapila (#117)

1 attachment(s)

On 02/16/2014 01:51 PM, Amit Kapila wrote:

On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I'm pretty sure the overhead of that would be negligible, so we could always
enable it. There are certainly a lot of scenarios where prefix/suffix
detection alone wouldn't help, but so what.

Attached is a quick patch for that, if you want to test it.

I have updated the patch to correct few problems, addressed review comments
by Andres and done some optimizations to improve CPU overhead for worst
case.

Thanks. I have to agree with Robert though that using the pglz encoding
when we're just checking for a common prefix/suffix is a pretty crappy
way of going about it [1]/messages/by-id/CA+TgmoZSTdQdKU7DHcLjChvMBrh1_YFOUSE+fuxESEVnc4jEgg@mail.gmail.com.

As the patch stands, it includes the NULL bitmap when checking for a
common prefix. That's probably not a good idea, because it defeats the
prefix detection in a the common case that you update a field from NULL
to not-NULL or vice versa.

Attached is a rewritten version, which does the prefix/suffix tests
directly in heapam.c, and adds the prefix/suffix lengths directly as
fields in the WAL record. If you could take one more look at this
version, to check if I've missed anything.

This ought to be tested with the new logical decoding stuff as it
modified the WAL update record format which the logical decoding stuff
also relies, but I don't know anything about that.

[1]: /messages/by-id/CA+TgmoZSTdQdKU7DHcLjChvMBrh1_YFOUSE+fuxESEVnc4jEgg@mail.gmail.com
/messages/by-id/CA+TgmoZSTdQdKU7DHcLjChvMBrh1_YFOUSE+fuxESEVnc4jEgg@mail.gmail.com

- Heikki

Attachments:

wal-update-prefix-suffix-5.patchtext/x-diff; name=wal-update-prefix-suffix-5.patchDownload

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index de4befa..cf23db5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6594,10 +6594,15 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xl_heap_header_len xlhdr;
 	xl_heap_header_len xlhdr_idx;
 	uint8		info;
+	uint16		prefix_suffix[2];
+	uint16		prefixlen = 0,
+				suffixlen = 0;
 	XLogRecPtr	recptr;
-	XLogRecData rdata[7];
+	XLogRecData rdata[9];
 	Page		page = BufferGetPage(newbuf);
 	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
+	int			nr;
+	Buffer		newbufref;
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
@@ -6607,6 +6612,54 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	else
 		info = XLOG_HEAP_UPDATE;
 
+	/*
+	 * If the old and new tuple are on the same page, we only need to log
+	 * the parts of the new tuple that were changed.  That saves on the amount
+	 * of WAL we need to write.  Currently, we just count any unchanged bytes
+	 * in the beginning or end of the tuples. That's quick to check, and
+	 * perfectly covers the common case that only one field is updated.
+	 *
+	 * We could do this even if the old and new tuple are on different pages,
+	 * but only if we don't make a full-page image of the old page, which is
+	 * difficult to know in advance.  Also, if the old tuple is corrupt for
+	 * some reason, it would allow propagate the corruption to the new page,
+	 * so it seems best to avoid.  Under the general assumption that for long
+	 * runs most updates tend to create new tuple version on same page, there
+	 * should not be significant impact on WAL reduction or performance.
+	 *
+	 * Skip this if we're taking a full-page image of the new page, as we don't
+	 * include the new tuple in the WAL record in that case.
+	 */
+	if (oldbuf == newbuf && (need_tuple_data || !XLogCheckBufferNeedsBackup(newbuf)))
+	{
+		char	   *oldp = (char *) oldtup->t_data + oldtup->t_data->t_hoff;
+		char	   *newp = (char *) newtup->t_data + newtup->t_data->t_hoff;
+		int			oldlen = oldtup->t_len - oldtup->t_data->t_hoff;
+		int			newlen = newtup->t_len - newtup->t_data->t_hoff;
+
+		/* Check for common prefix between old and new tuple */
+		for (prefixlen = 0; prefixlen < Min(oldlen, newlen); prefixlen++)
+		{
+			if (newp[prefixlen] != oldp[prefixlen])
+				break;
+		}
+		/*
+		 * Storing the length of the prefix takes 2 bytes, so we need to save
+		 * at least 3 bytes or there's no point.
+		 */
+		if (prefixlen < 3)
+			prefixlen = 0;
+
+		/* Same for suffix */
+		for (suffixlen = 0; suffixlen < Min(oldlen, newlen) - prefixlen; suffixlen++)
+		{
+			if (newp[newlen - suffixlen - 1] != oldp[oldlen - suffixlen - 1])
+				break;
+		}
+		if (suffixlen < 3)
+			suffixlen = 0;
+	}
+
 	xlrec.target.node = reln->rd_node;
 	xlrec.target.tid = oldtup->t_self;
 	xlrec.old_xmax = HeapTupleHeaderGetRawXmax(oldtup->t_data);
@@ -6619,41 +6672,119 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xlrec.newtid = newtup->t_self;
 	if (new_all_visible_cleared)
 		xlrec.flags |= XLOG_HEAP_NEW_ALL_VISIBLE_CLEARED;
+	if (prefixlen > 0)
+		xlrec.flags |= XLOG_HEAP_PREFIX_FROM_OLD;
+	if (suffixlen > 0)
+		xlrec.flags |= XLOG_HEAP_SUFFIX_FROM_OLD;
 
-	rdata[0].data = (char *) &xlrec;
-	rdata[0].len = SizeOfHeapUpdate;
-	rdata[0].buffer = InvalidBuffer;
+	/* If new tuple is the single and first tuple on page... */
+	if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
+		PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+	{
+		info |= XLOG_HEAP_INIT_PAGE;
+		newbufref = InvalidBuffer;
+	}
+	else
+		newbufref = newbuf;
+
+	rdata[0].data = NULL;
+	rdata[0].len = 0;
+	rdata[0].buffer = oldbuf;
+	rdata[0].buffer_std = true;
 	rdata[0].next = &(rdata[1]);
 
-	rdata[1].data = NULL;
-	rdata[1].len = 0;
-	rdata[1].buffer = oldbuf;
-	rdata[1].buffer_std = true;
+	rdata[1].data = (char *) &xlrec;
+	rdata[1].len = SizeOfHeapUpdate;
+	rdata[1].buffer = InvalidBuffer;
 	rdata[1].next = &(rdata[2]);
 
+	/* prefix and/or suffix length fields */
+	if (prefixlen > 0 || suffixlen > 0)
+	{
+		if (prefixlen > 0 && suffixlen > 0)
+		{
+			prefix_suffix[0] = prefixlen;
+			prefix_suffix[1] = suffixlen;
+			rdata[2].data = (char *) &prefix_suffix;
+			rdata[2].len = 2 * sizeof(uint16);
+		}
+		else if (prefixlen > 0)
+		{
+			rdata[2].data = (char *) &prefixlen;
+			rdata[2].len = sizeof(uint16);
+		}
+		else
+		{
+			rdata[2].data = (char *) &suffixlen;
+			rdata[2].len = sizeof(uint16); /* suffix only */
+		}
+		rdata[2].buffer = newbufref;
+		rdata[2].buffer_std = true;
+		rdata[2].next = &(rdata[3]);
+		nr = 3;
+	}
+	else
+		nr = 2;
+
 	xlhdr.header.t_infomask2 = newtup->t_data->t_infomask2;
 	xlhdr.header.t_infomask = newtup->t_data->t_infomask;
 	xlhdr.header.t_hoff = newtup->t_data->t_hoff;
-	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+	Assert(offsetof(HeapTupleHeaderData, t_bits) + prefixlen + suffixlen <= newtup->t_len);
+	xlhdr.t_len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits) - prefixlen - suffixlen;
 
 	/*
-	 * As with insert records, we need not store the rdata[2] segment
-	 * if we decide to store the whole buffer instead unless we're
-	 * doing logical decoding.
+	 * As with insert records, we need not store this rdata segment if we
+	 * decide to store the whole buffer instead, unless we're doing logical
+	 * decoding.
 	 */
-	rdata[2].data = (char *) &xlhdr;
-	rdata[2].len = SizeOfHeapHeaderLen;
-	rdata[2].buffer = need_tuple_data ? InvalidBuffer : newbuf;
-	rdata[2].buffer_std = true;
-	rdata[2].next = &(rdata[3]);
+	rdata[nr].data = (char *) &xlhdr;
+	rdata[nr].len = SizeOfHeapHeaderLen;
+	rdata[nr].buffer = need_tuple_data ? InvalidBuffer : newbufref;
+	rdata[nr].buffer_std = true;
+	rdata[nr].next = &(rdata[nr + 1]);
+	nr++;
 
-	/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-	rdata[3].data = (char *) newtup->t_data
-		+ offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits);
-	rdata[3].buffer = need_tuple_data ? InvalidBuffer : newbuf;
-	rdata[3].buffer_std = true;
-	rdata[3].next = NULL;
+	/*
+	 * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
+	 *
+	 * The 'data' doesn't include the common prefix or suffix.
+	 */
+	if (prefixlen == 0)
+	{
+		rdata[nr].data = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+		rdata[nr].len = newtup->t_len - offsetof(HeapTupleHeaderData, t_bits) - suffixlen;
+		rdata[nr].buffer = need_tuple_data ? InvalidBuffer : newbufref;
+		rdata[nr].buffer_std = true;
+		rdata[nr].next = NULL;
+		nr++;
+	}
+	else
+	{
+		/*
+		 * Have to write the null bitmap and data after the common prefix as
+		 * two separate rdata entries.
+		 */
+		/* bitmap [+ padding] [+ oid] */
+		if (newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits) > 0)
+		{
+			rdata[nr - 1].next = &(rdata[nr]);
+			rdata[nr].data = ((char *) newtup->t_data) + offsetof(HeapTupleHeaderData, t_bits);
+			rdata[nr].len = newtup->t_data->t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+			rdata[nr].buffer = need_tuple_data ? InvalidBuffer : newbufref;
+			rdata[nr].buffer_std = true;
+			rdata[nr].next = NULL;
+			nr++;
+		}
+
+		/* data after common prefix */
+		rdata[nr - 1].next = &(rdata[nr]);
+		rdata[nr].data = ((char *) newtup->t_data) + newtup->t_data->t_hoff + prefixlen;
+		rdata[nr].len = newtup->t_len - newtup->t_data->t_hoff - prefixlen - suffixlen;
+		rdata[nr].buffer = need_tuple_data ? InvalidBuffer : newbufref;
+		rdata[nr].buffer_std = true;
+		rdata[nr].next = NULL;
+		nr++;
+	}
 
 	/*
 	 * Separate storage for the FPW buffer reference of the new page in the
@@ -6661,13 +6792,15 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	*/
 	if (need_tuple_data)
 	{
-		rdata[3].next = &(rdata[4]);
+		rdata[nr - 1].next = &(rdata[nr]);
+
+		rdata[nr].data = NULL,
+		rdata[nr].len = 0;
+		rdata[nr].buffer = newbufref;
+		rdata[nr].buffer_std = true;
+		rdata[nr].next = NULL;
+		nr++;
 
-		rdata[4].data = NULL,
-		rdata[4].len = 0;
-		rdata[4].buffer = newbuf;
-		rdata[4].buffer_std = true;
-		rdata[4].next = NULL;
 		xlrec.flags |= XLOG_HEAP_CONTAINS_NEW_TUPLE;
 
 		/* We need to log a tuple identity */
@@ -6679,19 +6812,21 @@ log_heap_update(Relation reln, Buffer oldbuf,
 			xlhdr_idx.header.t_hoff = old_key_tuple->t_data->t_hoff;
 			xlhdr_idx.t_len = old_key_tuple->t_len;
 
-			rdata[4].next = &(rdata[5]);
-			rdata[5].data = (char *) &xlhdr_idx;
-			rdata[5].len = SizeOfHeapHeaderLen;
-			rdata[5].buffer = InvalidBuffer;
-			rdata[5].next = &(rdata[6]);
+			rdata[nr - 1].next = &(rdata[nr]);
+			rdata[nr].data = (char *) &xlhdr_idx;
+			rdata[nr].len = SizeOfHeapHeaderLen;
+			rdata[nr].buffer = InvalidBuffer;
+			rdata[nr].next = &(rdata[nr + 1]);
+			nr++;
 
 			/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
-			rdata[6].data = (char *) old_key_tuple->t_data
+			rdata[nr].data = (char *) old_key_tuple->t_data
 				+ offsetof(HeapTupleHeaderData, t_bits);
-			rdata[6].len = old_key_tuple->t_len
+			rdata[nr].len = old_key_tuple->t_len
 				- offsetof(HeapTupleHeaderData, t_bits);
-			rdata[6].buffer = InvalidBuffer;
-			rdata[6].next = NULL;
+			rdata[nr].buffer = InvalidBuffer;
+			rdata[nr].next = NULL;
+			nr++;
 
 			if (reln->rd_rel->relreplident == REPLICA_IDENTITY_FULL)
 				xlrec.flags |= XLOG_HEAP_CONTAINS_OLD_TUPLE;
@@ -6700,19 +6835,6 @@ log_heap_update(Relation reln, Buffer oldbuf,
 		}
 	}
 
-	/* If new tuple is the single and first tuple on page... */
-	if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
-		PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
-	{
-		XLogRecData *rcur = &rdata[2];
-		info |= XLOG_HEAP_INIT_PAGE;
-		while (rcur != NULL)
-		{
-			rcur->buffer = InvalidBuffer;
-			rcur = rcur->next;
-		}
-	}
-
 	recptr = XLogInsert(RM_HEAP_ID, info, rdata);
 
 	return recptr;
@@ -7739,17 +7861,25 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 	Page		page;
 	OffsetNumber offnum;
 	ItemId		lp = NULL;
+	HeapTupleData oldtup;
+	char	   *recdata;
 	HeapTupleHeader htup;
+	uint16		prefixlen = 0,
+				suffixlen = 0;
+	char	   *newp;
 	struct
 	{
 		HeapTupleHeaderData hdr;
 		char		data[MaxHeapTupleSize];
 	}			tbuf;
 	xl_heap_header_len xlhdr;
-	int			hsize;
 	uint32		newlen;
 	Size		freespace;
 
+	/* to keep the compiler quiet */
+	oldtup.t_data = NULL;
+	oldtup.t_len = 0;
+
 	/*
 	 * The visibility map may need to be fixed even if the heap page is
 	 * already up-to-date.
@@ -7816,6 +7946,9 @@ heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
 
 	htup = (HeapTupleHeader) PageGetItem(page, lp);
 
+	oldtup.t_data = htup;
+	oldtup.t_len = ItemIdGetLength(lp);
+
 	htup->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 	htup->t_infomask2 &= ~HEAP_KEYS_UPDATED;
 	if (hot_update)
@@ -7914,20 +8047,63 @@ newsame:;
 	if (PageGetMaxOffsetNumber(page) + 1 < offnum)
 		elog(PANIC, "heap_update_redo: invalid max offset number");
 
-	hsize = SizeOfHeapUpdate + SizeOfHeapHeaderLen;
+	recdata = (char *) xlrec + SizeOfHeapUpdate;
 
-	memcpy((char *) &xlhdr,
-		   (char *) xlrec + SizeOfHeapUpdate,
-		   SizeOfHeapHeaderLen);
-	newlen = xlhdr.t_len;
-	Assert(newlen <= MaxHeapTupleSize);
+	if (xlrec->flags & XLOG_HEAP_PREFIX_FROM_OLD)
+	{
+		memcpy(&prefixlen, recdata, sizeof(uint16));
+		recdata += sizeof(uint16);
+	}
+	if (xlrec->flags & XLOG_HEAP_SUFFIX_FROM_OLD)
+	{
+		memcpy(&suffixlen, recdata, sizeof(uint16));
+		recdata += sizeof(uint16);
+	}
+
+	memcpy((char *) &xlhdr, recdata, SizeOfHeapHeaderLen);
+	recdata += SizeOfHeapHeaderLen;
+
+	Assert(xlhdr.t_len + prefixlen + suffixlen <= MaxHeapTupleSize);
 	htup = &tbuf.hdr;
 	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
-	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
-	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
-		   (char *) xlrec + hsize,
-		   newlen);
-	newlen += offsetof(HeapTupleHeaderData, t_bits);
+
+	/*
+	 * Reconstruct the new tuple using the prefix and/or suffix from the old
+	 * tuple, and the data stored in the WAL record.
+	 */
+	newp = (char *) htup + offsetof(HeapTupleHeaderData, t_bits);
+	if (prefixlen > 0)
+	{
+		int			len;
+
+		/* copy bitmap [+ padding] [+ oid] from WAL record */
+		len = xlhdr.header.t_hoff - offsetof(HeapTupleHeaderData, t_bits);
+		memcpy(newp, recdata, len);
+		recdata += len;
+		newp += len;
+
+		/* copy prefix from old tuple */
+		memcpy(newp, (char *) oldtup.t_data + oldtup.t_data->t_hoff, prefixlen);
+		newp += prefixlen;
+
+		/* copy new tuple data from WAL record */
+		len = xlhdr.t_len - (xlhdr.header.t_hoff - offsetof(HeapTupleHeaderData, t_bits));
+		memcpy(newp, recdata, len);
+		recdata += len;
+		newp += len;
+	}
+	else
+	{
+		/* copy bitmap [+ padding] [+ oid] + data, all in one go */
+		memcpy(newp, recdata, xlhdr.t_len);
+		recdata += xlhdr.t_len;
+		newp += xlhdr.t_len;
+	}
+	/* copy suffix from old tuple */
+	if (suffixlen > 0)
+		memcpy(newp, (char *) oldtup.t_data + oldtup.t_len - suffixlen, suffixlen);
+
+	newlen = offsetof(HeapTupleHeaderData, t_bits) + xlhdr.t_len + prefixlen + suffixlen;
 	htup->t_infomask2 = xlhdr.header.t_infomask2;
 	htup->t_infomask = xlhdr.header.t_infomask;
 	htup->t_hoff = xlhdr.header.t_hoff;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ad46eb0..05a83a4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2333,6 +2333,28 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 }
 
 /*
+ * Determine whether the buffer referenced has to be backed up. Since we don't
+ * yet have the insert lock, fullPageWrites and forcePageWrites could change
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)
+{
+	bool		doPageWrites;
+	Page		page;
+
+	page = BufferGetPage(buffer);
+
+	doPageWrites = XLogCtl->Insert.fullPageWrites || XLogCtl->Insert.forcePageWrites;
+
+	if (doPageWrites && PageGetLSN(page) <= RedoRecPtr)
+		return true;			/* buffer requires backup */
+
+	return false;				/* buffer does not need to be backed up */
+}
+
+/*
  * Determine whether the buffer referenced by an XLogRecData item has to
  * be backed up, and if so fill a BkpBlock struct for it.  In any case
  * save the buffer's LSN at *lsn.
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index d4383ab..82237c9 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -67,6 +67,8 @@
 #define XLOG_HEAP_CONTAINS_OLD_TUPLE		(1<<2)
 #define XLOG_HEAP_CONTAINS_OLD_KEY			(1<<3)
 #define XLOG_HEAP_CONTAINS_NEW_TUPLE		(1<<4)
+#define XLOG_HEAP_PREFIX_FROM_OLD			(1<<5)
+#define XLOG_HEAP_SUFFIX_FROM_OLD			(1<<6)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLOG_HEAP_CONTAINS_OLD 						\
@@ -179,9 +181,25 @@ typedef struct xl_heap_update
 	ItemPointerData newtid;		/* new inserted tuple id */
 	uint8		old_infobits_set;	/* infomask bits to set on old tuple */
 	uint8		flags;
-	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
+
+	/*
+	 * If XLOG_HEAP_PREFIX_FROM_OLD or XLOG_HEAP_SUFFIX_FROM_OLD flags are
+	 * set, the prefix and/or suffix come next, as one or two uint16s.
+	 *
+	 * After that, xl_heap_header_len and new tuple data follows.  The new
+	 * tuple data and length don't include the prefix and suffix, which are
+	 * copied from the old tuple on replay.  The new tuple data is omitted if
+	 * a full-page image of the page was taken (unless the
+	 * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, in which case it's included
+	 * anyway).
+	 *
+	 * If XLOG_HEAP_CONTAINS_OLD_TUPLE or XLOG_HEAP_CONTAINS_OLD_KEY flags are
+	 * set, another xl_heap_header_len struct and and tuple data for the old
+	 * tuple follows.
+	 */
 } xl_heap_update;
 
+/* NOTE: SizeOfHeapUpdate doesn't include prefix_suffix */
 #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(uint8))
 
 /*
diff --git a/src/include/access/htup_details.h b/src/include/access/htup_details.h
index a3eba98..abb5620 100644
--- a/src/include/access/htup_details.h
+++ b/src/include/access/htup_details.h
@@ -740,6 +740,11 @@ extern HeapTuple heap_modify_tuple(HeapTuple tuple,
 extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
 				  Datum *values, bool *isnull);
 
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+				  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+				HeapTuple newtup);
+
 /* these three are deprecated versions of the three above: */
 extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
 			   Datum *values, char *nulls);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 11ab277..87402b0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -279,6 +279,7 @@ typedef struct CheckpointStatsData
 extern CheckpointStatsData CheckpointStats;
 
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
+extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
diff --git a/src/test/regress/expected/update.out b/src/test/regress/expected/update.out
index 71b856f..bad75a2 100644
--- a/src/test/regress/expected/update.out
+++ b/src/test/regress/expected/update.out
@@ -97,3 +97,73 @@ SELECT a, b, char_length(c) FROM update_test;
 (2 rows)
 
 DROP TABLE update_test;
+--
+-- Test to update continuous and non continuous columns
+--
+DROP TABLE IF EXISTS update_test;
+NOTICE:  table "update_test" does not exist, skipping
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |  ctime   
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+----------
+    1 | t   | Test  |  7.169 | B     | CSD   | 01-01-2000 |   520 | road2,                     +| dcy2   | M    | 12000 | 50.4 | 00:00:00
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+SELECT * from update_test;
+ bser | bln | ename | perf_f | grade | dept  |    dob     | idnum |            addr             | destn  | gend | samba | hgt  |   ctime    
+------+-----+-------+--------+-------+-------+------------+-------+-----------------------------+--------+------+-------+------+------------
+    1 | f   | Tes   |    8.9 | B     | Test  | 01-01-2000 |   520 | road2,                     +| moved  | M    |     0 | 10.1 | 00:00:00.1
+      |     |       |        |       |       |            |       |                 streeeeet2,+|        |      |       |      | 
+      |     |       |        |       |       |            |       |                 city2       |        |      |       |      | 
+(1 row)
+
+DROP TABLE update_test;
diff --git a/src/test/regress/sql/update.sql b/src/test/regress/sql/update.sql
index a8a028f..bfe4ecb 100644
--- a/src/test/regress/sql/update.sql
+++ b/src/test/regress/sql/update.sql
@@ -59,3 +59,70 @@ UPDATE update_test SET c = repeat('x', 10000) WHERE c = 'car';
 SELECT a, b, char_length(c) FROM update_test;
 
 DROP TABLE update_test;
+
+
+--
+-- Test to update continuous and non continuous columns
+--
+
+DROP TABLE IF EXISTS update_test;
+CREATE TABLE update_test (
+		bser bigserial,
+		bln boolean,
+		ename VARCHAR(25),
+		perf_f float(8),
+		grade CHAR,
+		dept CHAR(5) NOT NULL,
+		dob DATE,
+		idnum INT,
+		addr VARCHAR(30) NOT NULL,
+		destn CHAR(6),
+		Gend CHAR,
+		samba BIGINT,
+		hgt float,
+		ctime TIME
+);
+
+INSERT INTO update_test VALUES (
+		nextval('update_test_bser_seq'::regclass),
+		TRUE,
+		'Test',
+		7.169,
+		'B',
+		'CSD',
+		'2000-01-01',
+		520,
+		'road2,
+		streeeeet2,
+		city2',
+		'dcy2',
+		'M',
+		12000,
+		50.4,
+		'00:00:00.0'
+);
+
+SELECT * from update_test;
+
+-- update first column
+UPDATE update_test SET bser = bser - 1 + 1;
+
+-- update middle column
+UPDATE update_test SET perf_f = 8.9;
+
+-- update last column
+UPDATE update_test SET ctime = '00:00:00.1';
+
+-- update 3 continuos columns
+UPDATE update_test SET destn = 'dcy2', samba = 0 WHERE Gend = 'M' and dept = 'CSD';
+
+-- update two non continuos columns
+UPDATE update_test SET destn = 'moved', samba = 0;
+UPDATE update_test SET bln = FALSE, hgt = 10.1;
+
+-- update causing some column alignment difference
+UPDATE update_test SET ename = 'Tes';
+UPDATE update_test SET dept = 'Test';
+
+SELECT * from update_test;
+DROP TABLE update_test;

#119

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Heikki Linnakangas (#118)

On 2014-03-03 16:27:05 +0200, Heikki Linnakangas wrote:

Thanks. I have to agree with Robert though that using the pglz encoding when
we're just checking for a common prefix/suffix is a pretty crappy way of
going about it [1].

As the patch stands, it includes the NULL bitmap when checking for a common
prefix. That's probably not a good idea, because it defeats the prefix
detection in a the common case that you update a field from NULL to not-NULL
or vice versa.

Attached is a rewritten version, which does the prefix/suffix tests directly
in heapam.c, and adds the prefix/suffix lengths directly as fields in the
WAL record. If you could take one more look at this version, to check if
I've missed anything.

Have you rerun the benchmarks? I'd guess the CPU overhead of this
version is lower than earlier versions, but seing it tested won't be a
bad idea.

This ought to be tested with the new logical decoding stuff as it modified
the WAL update record format which the logical decoding stuff also relies,
but I don't know anything about that.

Hm, I think all it needs to do disable delta encoding if
need_tuple_data (which is dependent on wal_level=logical).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Andres Freund (#119)

On Mon, Mar 3, 2014 at 9:57 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Hm, I think all it needs to do disable delta encoding if
need_tuple_data (which is dependent on wal_level=logical).

Why does it need to do that? The logical decoding stuff should be
able to reverse out the delta encoding.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Robert Haas (#120)

On 2014-03-03 10:35:03 -0500, Robert Haas wrote:

On Mon, Mar 3, 2014 at 9:57 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Hm, I think all it needs to do disable delta encoding if
need_tuple_data (which is dependent on wal_level=logical).

Why does it need to do that? The logical decoding stuff should be
able to reverse out the delta encoding.

Against what should it perform the delta? Unless I misunderstand how the
patch works, it computes the delta against the old tuple in the heap
page?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Andres Freund (#121)

On Mon, Mar 3, 2014 at 10:38 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-03-03 10:35:03 -0500, Robert Haas wrote:

On Mon, Mar 3, 2014 at 9:57 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Hm, I think all it needs to do disable delta encoding if
need_tuple_data (which is dependent on wal_level=logical).

Why does it need to do that? The logical decoding stuff should be
able to reverse out the delta encoding.

Against what should it perform the delta? Unless I misunderstand how the
patch works, it computes the delta against the old tuple in the heap
page?

Oh, maybe I need more caffeine.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Andres Freund (#119)

On 03/03/2014 04:57 PM, Andres Freund wrote:

On 2014-03-03 16:27:05 +0200, Heikki Linnakangas wrote:

Attached is a rewritten version, which does the prefix/suffix tests directly
in heapam.c, and adds the prefix/suffix lengths directly as fields in the
WAL record. If you could take one more look at this version, to check if
I've missed anything.

Have you rerun the benchmarks?

No.

I'd guess the CPU overhead of this version is lower than earlier
versions,

That's what I would expect too.

but seing it tested won't be a bad idea.

Agreed. Amit, do you have the test setup at hand, can you check the
performance of this one more time?

Also, I removed the GUC and table level options, on the assumption that
this is cheap enough even when it's not helping that we don't need to
make it configurable.

This ought to be tested with the new logical decoding stuff as it modified
the WAL update record format which the logical decoding stuff also relies,
but I don't know anything about that.

Hm, I think all it needs to do disable delta encoding if
need_tuple_data (which is dependent on wal_level=logical).

That's a pity, but we can live with it. If we did this at a higher level
and checked which columns have been modified, we could include just the
modified fields in the record, which should to be enough for logical
decoding. It might be even more useful for logical decoding too to know
exactly which fields were changed.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Heikki Linnakangas (#123)

On 2014-03-04 12:43:48 +0200, Heikki Linnakangas wrote:

This ought to be tested with the new logical decoding stuff as it modified
the WAL update record format which the logical decoding stuff also relies,
but I don't know anything about that.

Hm, I think all it needs to do disable delta encoding if
need_tuple_data (which is dependent on wal_level=logical).

That's a pity, but we can live with it.

Agreed. This is hardly the first optimization that only works for some
wal_levels.

If we did this at a higher level and
checked which columns have been modified, we could include just the modified
fields in the record, which should to be enough for logical decoding. It
might be even more useful for logical decoding too to know exactly which
fields were changed.

Yea, I argued that way elsewhere in this thread. I do think we're going
to need per column info for further features in the near future. It's a
bit absurd that we're computing various sets of changed columns (HOT,
key, identity) plus the pre/postfix with this patchset.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#125

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#118)

On Mon, Mar 3, 2014 at 7:57 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 02/16/2014 01:51 PM, Amit Kapila wrote:

On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Thanks. I have to agree with Robert though that using the pglz encoding when
we're just checking for a common prefix/suffix is a pretty crappy way of
going about it [1].

As the patch stands, it includes the NULL bitmap when checking for a common
prefix. That's probably not a good idea, because it defeats the prefix
detection in a the common case that you update a field from NULL to not-NULL
or vice versa.

Attached is a rewritten version, which does the prefix/suffix tests directly
in heapam.c, and adds the prefix/suffix lengths directly as fields in the
WAL record. If you could take one more look at this version, to check if
I've missed anything.

I had verified the patch and found few minor points:
1.
+extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+  HeapTuple newtup, char *encdata, uint32 *enclen);
+extern void heap_delta_decode (char *encdata, uint32 enclen, HeapTuple oldtup,
+ HeapTuple newtup);

Declaration for above functions are not required now.

2.
+ * later, but will not cause any problem because this function is used only to
+ * identify whether EWT is required for update.
+ */
+bool
+XLogCheckBufferNeedsBackup(Buffer buffer)

Here, I think we can change the comment to avoid word
EWT (Encoded WAL tuple), as now we changed compression
mechanism and it's not used anywhere else.

One Question:
+ rdata[1].data = (char *) &xlrec;
Earlier it seems to store record hearder as first segment rdata[0],
whats the reason of changing it?

I have verified the patch by doing crash recovery for below scenario's
and it worked fine:
a. no change in old and new tuple
b. all changed in new tuple
c. half changed (update half of the values to NULLS) in new tuple
d. only prefix same in new tuple
e. only suffix same in new tuple
f. prefix-suffix same, other columns values changed in new tuple.

Performance Data
----------------------------

Non-Default settings

autovacuum = off
checkpoitnt_segments = 256
checkpoint_timeout =15min
full_page_writes = off

Unpatched

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 573506704 | 9.56587505340576
one short and one long field, no change | 575351216 | 9.97713398933411
one short and one long field, no change | 573501848 | 9.76377606391907
hundred tiny fields, all changed | 364894056 | 13.3053929805756
hundred tiny fields, all changed | 364891536 | 13.3533811569214
hundred tiny fields, all changed | 364889264 | 13.3041989803314
hundred tiny fields, half changed | 365411920 | 14.1831648349762
hundred tiny fields, half changed | 365918216 | 13.6393811702728
hundred tiny fields, half changed | 366456552 | 13.6420011520386
hundred tiny fields, half nulled | 300705288 | 12.8859741687775
hundred tiny fields, half nulled | 301665624 | 12.6988201141357
hundred tiny fields, half nulled | 300700504 | 13.3536100387573
9 short and 1 long, short changed | 396983080 | 8.83671307563782
9 short and 1 long, short changed | 396987976 | 9.23769211769104
9 short and 1 long, short changed | 396984080 | 9.45178604125977

wal-update-prefix-suffix-5.patch

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 156278832 | 6.69434094429016
one short and one long field, no change | 156277352 | 6.70855903625488
one short and one long field, no change | 156280040 | 6.70657396316528
hundred tiny fields, all changed | 364895152 | 13.6677348613739
hundred tiny fields, all changed | 364892256 | 12.7107839584351
hundred tiny fields, all changed | 364890424 | 13.7760601043701
hundred tiny fields, half changed | 365970360 | 13.1902158260345
hundred tiny fields, half changed | 364895120 | 13.5730090141296
hundred tiny fields, half changed | 367031168 | 13.7023210525513
hundred tiny fields, half nulled | 204418576 | 12.1997199058533
hundred tiny fields, half nulled | 204422880 | 11.4583330154419
hundred tiny fields, half nulled | 204417464 | 12.0228970050812
9 short and 1 long, short changed | 220466016 | 8.14843511581421
9 short and 1 long, short changed | 220471168 | 8.03712797164917
9 short and 1 long, short changed | 220464464 | 8.55907511711121
(15 rows)

Conclusion is that patch shows good WAL reduction and performance
improvement for favourable cases without CPU overhead for non-favourable
cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#126

Amit Kapila

amit.kapila16@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#123)

On Tue, Mar 4, 2014 at 4:13 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Agreed. Amit, do you have the test setup at hand, can you check the
performance of this one more time?

Are you expecting more performance numbers than I have posted?
Is there anything more left for patch which you are expecting?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#127

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Amit Kapila (#125)

On 03/04/2014 01:58 PM, Amit Kapila wrote:

On Mon, Mar 3, 2014 at 7:57 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 02/16/2014 01:51 PM, Amit Kapila wrote:

On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Thanks. I have to agree with Robert though that using the pglz encoding when
we're just checking for a common prefix/suffix is a pretty crappy way of
going about it [1].

As the patch stands, it includes the NULL bitmap when checking for a common
prefix. That's probably not a good idea, because it defeats the prefix
detection in a the common case that you update a field from NULL to not-NULL
or vice versa.

Attached is a rewritten version, which does the prefix/suffix tests directly
in heapam.c, and adds the prefix/suffix lengths directly as fields in the
WAL record. If you could take one more look at this version, to check if
I've missed anything.

I had verified the patch and found few minor points:
...

Fixed those.

One Question:
+ rdata[1].data = (char *) &xlrec;
Earlier it seems to store record hearder as first segment rdata[0],
whats the reason of changing it?

I found the code easier to read that way. The order of rdata entries
used to be:

0: xl_heap_update struct
1: full-page reference to oldbuf (no data)
2: xl_heap_header_len struct for the new tuple
3-7: logical decoding stuff

The prefix/suffix fields made that order a bit awkward, IMHO. They are
logically part of the header, even though they're not part of the struct
(they are documented in comments inside the struct). So they ought to
stay together with the xl_heap_update struct. Another option would've
been to move it after the xl_heap_header_len struct.

Note that this doesn't affect the on-disk format of the WAL record,
because the moved rdata entry is just a full-page reference, with no
payload of its own.

I have verified the patch by doing crash recovery for below scenario's
and it worked fine:
a. no change in old and new tuple
b. all changed in new tuple
c. half changed (update half of the values to NULLS) in new tuple
d. only prefix same in new tuple
e. only suffix same in new tuple
f. prefix-suffix same, other columns values changed in new tuple.

Thanks!

Conclusion is that patch shows good WAL reduction and performance
improvement for favourable cases without CPU overhead for non-favourable
cases.

Ok, great. Committed!

I left out the regression tests. It was good to have them while
developing this, but I don't think there's a lot of value in including
them permanently in the regression suite. Low-level things like the
alignment-sensitive test are fragile, and can easily stop testing the
thing it's supposed to test, depending on the platform and future
changes in the code. And the current algorithm doesn't care much about
alignment anyway.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#128

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Heikki Linnakangas (#127)

On Wed, Mar 12, 2014 at 5:30 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Ok, great. Committed!

Awesome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers