Re: [WIP] Performance Improvement by reducing WAL for Update Operation

Started by Amit kapilaabout 13 years ago15 messages
#1Amit kapila
amit.kapila@huawei.com
2 attachment(s)

On Saturday, October 06, 2012 7:34 PM Amit Kapila wrote:

Please find the readings of LZ patch along with Xlog-Scale patch.
The comparison is between for Update operations
base code + Xlog Scale Patch
base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)

This contains all the consolidated data and comparison for both the approaches:

The difference of this testcase as compare to previous one is that it has default value of wal_page_size ( 8K ) as compare to previous one where configuration used for wal_page_size was 1K

pgbench_lz_wal_page_8k (LZ Compression Approach)-
The comparison for Update operations is between
base code + Xlog Scale Patch
base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)

pgbench_wal_mod_wal_page_8K (Offset Approach initialy used + changes suggested by you and noah)

base code + Xlog Scale Patch
base code + Xlog Scale Patch + Update WAL Optimization (Offset Approach including Memcmp of tuples)

Observations From Performance Data

----------------------------------------------

1. With both the approaches Performance data is good.

LZ compression - upto 100% performance improvement.

Offset Approach - upto 160% performance improvement.

2. The performance data is better for LZ compression approach when the changed value of tuple is large. (Refer 500 length changed value).

3. The performance data is better for Offset Approach for 1 thread for any size of Data (it dips for LZ compression Approach).

Can you please send me your feedback about which approach can be finalized.

For LZ Compression - Already the patch is uploaded to Commitfest with fixes for defects found.

For Offset Approach - I can upload it, if the decision is to use Offset based approach.

With Regards,

Amit Kapila.

Attachments:

pgbench_lz_wal_page_8k.htmtext/html; name=pgbench_lz_wal_page_8k.htmDownload
pgbench_wal_mod_wal_page_8k.htmtext/html; name=pgbench_wal_mod_wal_page_8k.htmDownload
#2Noah Misch
noah@leadboat.com
In reply to: Amit kapila (#1)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

Hi Amit,

On Tue, Oct 16, 2012 at 09:22:39AM +0000, Amit kapila wrote:

On Saturday, October 06, 2012 7:34 PM Amit Kapila wrote:

Please find the readings of LZ patch along with Xlog-Scale patch.
The comparison is between for Update operations
base code + Xlog Scale Patch
base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)

This contains all the consolidated data and comparison for both the approaches:

The difference of this testcase as compare to previous one is that it has default value of wal_page_size ( 8K ) as compare to previous one where configuration used for wal_page_size was 1K

What is "wal_page_size"? Is that ./configure --with-wal-blocksize?

Observations From Performance Data
----------------------------------------------
1. With both the approaches Performance data is good.
LZ compression - upto 100% performance improvement.
Offset Approach - upto 160% performance improvement.
2. The performance data is better for LZ compression approach when the changed value of tuple is large. (Refer 500 length changed value).
3. The performance data is better for Offset Approach for 1 thread for any size of Data (it dips for LZ compression Approach).

Stepping back a moment, I would expect this patch to change performance in at
least four ways (Heikki largely covered this upthread):

a) High-concurrency workloads will improve thanks to reduced WAL insert
contention.
b) All workloads will degrade due to the CPU cost of identifying and
implementing the optimization.
c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume.
d) Workloads composed primarily of long transactions with high WAL volume will
improve due to having fewer end-of-WAL-segment fsync requests.

Your benchmark numbers show small gains and losses for single-client
workloads, moving to moderate gains for 2-client workloads. This suggests
strong influence from (a), some influence from (b), and little influence from
(c) and (d). Actually, the response to scale evident in your numbers seems
too good to be true; why would (a) have such a large effect over the
transition from one client to two clients? Also, for whatever reason, all
your numbers show fairly bad scaling. With the XLOG scale and LZ patches,
synchronous_commit=off, -F 80, and rec length 250, 8-client average
performance is only 2x that of 1-client average performance.

I attempted to reproduce this effect on an EC2 m2.4xlarge instance (8 cores,
70 GiB) with the data directory under a tmpfs mount. This should thoroughly
isolate effects (a) and (b) from (c) and (d). I used your pgbench_250.c[1]http://archives.postgresql.org/message-id/001d01cda180$9f1e47a0$dd5ad6e0$@kapila@huawei.com in
30s runs. Configuration:

autovacuum | off
checkpoint_segments | 500
checkpoint_timeout | 1h
client_encoding | UTF8
lc_collate | C
lc_ctype | C
max_connections | 100
server_encoding | SQL_ASCII
shared_buffers | 4GB
wal_buffers | 16MB

Benchmark results:

-Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8-
HEAD,-F80 816 1644 6528 1821 MiB
xlogscale,-F80 824 1643 6551 1826 MiB
xlogscale+lz,-F80 717 1466 5924 1137 MiB
xlogscale+lz,-F100 753 1508 5948 1548 MiB

Those are short runs with no averaging of multiple iterations; don't put too
much faith in the absolute numbers. Still, I consistently get linear scaling
from 1 client to 8 clients. Why might your results have been so different in
this regard?

It's also odd that your -F100 numbers tend to follow your -F80 numbers despite
the optimization kicking in far more frequently for the latter.

nm

[1]: http://archives.postgresql.org/message-id/001d01cda180$9f1e47a0$dd5ad6e0$@kapila@huawei.com

#3Amit kapila
amit.kapila@huawei.com
In reply to: Noah Misch (#2)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

Wednesday, October 24, 2012 5:51 AM Noah Misch wrote:

Hi Amit,

Noah, Thank you for taking the performance data.

On Tue, Oct 16, 2012 at 09:22:39AM +0000, Amit kapila wrote:
On Saturday, October 06, 2012 7:34 PM Amit Kapila wrote:

Please find the readings of LZ patch along with Xlog-Scale patch.
The comparison is between for Update operations
base code + Xlog Scale Patch
base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)

This contains all the consolidated data and comparison for both the approaches:

The difference of this testcase as compare to previous one is that it has default value of wal_page_size ( 8K ) as compare to previous one where configuration used for wal_page_size was 1K

What is "wal_page_size"? Is that ./configure --with-wal-blocksize?

Yes.

Observations From Performance Data
----------------------------------------------
1. With both the approaches Performance data is good.
LZ compression - upto 100% performance improvement.
Offset Approach - upto 160% performance improvement.
2. The performance data is better for LZ compression approach when the changed value of tuple is large. (Refer 500 length changed value).
3. The performance data is better for Offset Approach for 1 thread for any size of Data (it dips for LZ compression Approach).

Stepping back a moment, I would expect this patch to change performance in at
least four ways (Heikki largely covered this upthread):

a) High-concurrency workloads will improve thanks to reduced WAL insert
contention.
b) All workloads will degrade due to the CPU cost of identifying and
implementing the optimization.
c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume.
d) Workloads composed primarily of long transactions with high WAL volume will
improve due to having fewer end-of-WAL-segment fsync requests.

All your points are very good summarization of work, but I think one point can be added :
e) Reduced the cost of doing crc and copying less data in Xlog buffer in XLogInsert() due to reduced size of xlog record.

Your benchmark numbers show small gains and losses for single-client
workloads, moving to moderate gains for 2-client workloads. This suggests
strong influence from (a), some influence from (b), and little influence from
(c) and (d). Actually, the response to scale evident in your numbers seems
too good to be true; why would (a) have such a large effect over the
transition from one client to two clients?

I think if we just see from the point of LZ compression, there are predominently 2 things, your point (b) and point (e) mentioned by me.
For single threads, the cost of doing compression supercedes the cost of crc and other improvement in xloginsert().
However when come to multi threads, the cost reduction due to point (e) will reduce the time under lock and hence we see such a effect from
1 client to 2 clients.

Also, for whatever reason, all
your numbers show fairly bad scaling. With the XLOG scale and LZ patches,
synchronous_commit=off, -F 80, and rec length 250, 8-client average
performance is only 2x that of 1-client average performance.

I am really sorry, this is my mistake about putting the numbers; the 8 threads number is actually a number with -c16 -j8
means 16 clients and 8 threads. That can be the reason it's just showing 2X otherwise it would have shown numbers similar to what you are seeing.

Benchmark results:

-Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8-
HEAD,-F80 816 1644 6528 1821 MiB
xlogscale,-F80 824 1643 6551 1826 MiB
xlogscale+lz,-F80 717 1466 5924 1137 MiB
xlogscale+lz,-F100 753 1508 5948 1548 MiB

Those are short runs with no averaging of multiple iterations; don't put too
much faith in the absolute numbers. Still, I consistently get linear scaling
from 1 client to 8 clients. Why might your results have been so different in
this regard?

1. The only reason for you seeing the difference of linear scalability can be because of the numbers I have posted for 8 threads is
of run with -c16 -j8. I shall run with -c8 and post the performance numbers. I am hoping it should match the way you see the numbers
2. Now, if we see that in the results you have posted,
a) there is not much performance difference between head and xlog scale
b) with LZ patch it shows there is decrease in performance
I think this can be because it has ran for very less time as you have also mentioned.

It's also odd that your -F100 numbers tend to follow your -F80 numbers despite
the optimization kicking in far more frequently for the latter.

The results with avg of 3 - 15mins runs for LZ patch are:
-Patch- -tps@-c1- -tps@-c2- -tps@-c16-j8
xlogscale+lz,-F80 663 1232 2498
xlogscale+lz,-F100 660 1221 2361

The result is showing that avg. tps is better with -F80 which is I think what is expected.

So to conclude, according to me, following needs to be done.

1. to check the major discrepency of data about linear scaling, I shall take the data with -c8 configuration rather than with -c16 -j8.
2. to conclude whether LZ patch, gives better performance, I think it needs to be run for longer time.

Please let me know what is you opinion for above, do we need to do anything more than what is mentioned?

With Regards,
Amit Kapila.

#4Noah Misch
noah@leadboat.com
In reply to: Amit kapila (#3)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Wed, Oct 24, 2012 at 05:55:56AM +0000, Amit kapila wrote:

Wednesday, October 24, 2012 5:51 AM Noah Misch wrote:

Stepping back a moment, I would expect this patch to change performance in at
least four ways (Heikki largely covered this upthread):

a) High-concurrency workloads will improve thanks to reduced WAL insert
contention.
b) All workloads will degrade due to the CPU cost of identifying and
implementing the optimization.
c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume.
d) Workloads composed primarily of long transactions with high WAL volume will
improve due to having fewer end-of-WAL-segment fsync requests.

All your points are very good summarization of work, but I think one point can be added :
e) Reduced the cost of doing crc and copying less data in Xlog buffer in XLogInsert() due to reduced size of xlog record.

True.

Your benchmark numbers show small gains and losses for single-client
workloads, moving to moderate gains for 2-client workloads. This suggests
strong influence from (a), some influence from (b), and little influence from
(c) and (d). Actually, the response to scale evident in your numbers seems
too good to be true; why would (a) have such a large effect over the
transition from one client to two clients?

I think if we just see from the point of LZ compression, there are predominently 2 things, your point (b) and point (e) mentioned by me.
For single threads, the cost of doing compression supercedes the cost of crc and other improvement in xloginsert().
However when come to multi threads, the cost reduction due to point (e) will reduce the time under lock and hence we see such a effect from
1 client to 2 clients.

Note that the CRC calculation over variable-size data in the WAL record
happens before taking WALInsertLock.

Also, for whatever reason, all
your numbers show fairly bad scaling. With the XLOG scale and LZ patches,
synchronous_commit=off, -F 80, and rec length 250, 8-client average
performance is only 2x that of 1-client average performance.

Correction: with the XLOG scale patch only, your benchmark runs show 8-client
average performance as 2x that of 1-client average performance. With both the
XLOG scale and LZ patches, it grows to almost 4x. However, both ought to be
closer to 8x.

-Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8-
HEAD,-F80 816 1644 6528 1821 MiB
xlogscale,-F80 824 1643 6551 1826 MiB
xlogscale+lz,-F80 717 1466 5924 1137 MiB
xlogscale+lz,-F100 753 1508 5948 1548 MiB

Those are short runs with no averaging of multiple iterations; don't put too
much faith in the absolute numbers. Still, I consistently get linear scaling
from 1 client to 8 clients. Why might your results have been so different in
this regard?

1. The only reason for you seeing the difference of linear scalability can be because of the numbers I have posted for 8 threads is
of run with -c16 -j8. I shall run with -c8 and post the performance numbers. I am hoping it should match the way you see the numbers

I doubt that. Your 2-client numbers also show scaling well-below linear.
With 8 cores, 16-client performance should not fall off compared to 8 clients.

Perhaps 2 clients saturate your I/O under this workload, but 1 client does
not. Granted, that theory doesn't explain all your numbers, such as the
improvement for record length 50 @ -c1.

2. Now, if we see that in the results you have posted,
a) there is not much performance difference between head and xlog scale

Note that the xlog scale patch addresses a different workload:
http://archives.postgresql.org/message-id/505B3648.1040801@vmware.com

b) with LZ patch it shows there is decrease in performance
I think this can be because it has ran for very less time as you have also mentioned.

Yes, that's possible.

It's also odd that your -F100 numbers tend to follow your -F80 numbers despite
the optimization kicking in far more frequently for the latter.

The results with avg of 3 - 15mins runs for LZ patch are:
-Patch- -tps@-c1- -tps@-c2- -tps@-c16-j8
xlogscale+lz,-F80 663 1232 2498
xlogscale+lz,-F100 660 1221 2361

The result is showing that avg. tps is better with -F80 which is I think what is expected.

Yes. Let me elaborate on the point I hoped to make. Based on my test above,
-F80 more than doubles the bulk WAL savings compared to -F100. Your benchmark
runs showed a 61.8% performance improvement at -F100 and a 62.5% performance
improvement at -F80. If shrinking WAL increases performance, shrinking it
more should increase performance more. Instead, you observed similar levels
of improvement at both fill factors. Why?

So to conclude, according to me, following needs to be done.

1. to check the major discrepency of data about linear scaling, I shall take the data with -c8 configuration rather than with -c16 -j8.

With unpatched HEAD, synchronous_commit=off, and sufficient I/O bandwidth, you
should be able to get pgbench to scale linearly to 8 clients. You can then
benchmark for effects (a), (b) and (e). With insufficient I/O bandwidth,
you're benchmarking (c) and (d). (And/or other effects I haven't considered.)

2. to conclude whether LZ patch, gives better performance, I think it needs to be run for longer time.

Agreed.

Please let me know what is you opinion for above, do we need to do anything more than what is mentioned?

I think the next step is to figure out what limits your scaling. Then we can
form a theory about the meaning of your benchmark numbers.

nm

#5Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#2)
Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Tue, Oct 23, 2012 at 8:21 PM, Noah Misch <noah@leadboat.com> wrote:

-Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8-
HEAD,-F80 816 1644 6528 1821 MiB
xlogscale,-F80 824 1643 6551 1826 MiB
xlogscale+lz,-F80 717 1466 5924 1137 MiB
xlogscale+lz,-F100 753 1508 5948 1548 MiB

Ouch. I've been pretty excited by this patch, but I don't think we
want to take an "optimization" that produces a double-digit hit at 1
client and doesn't gain even at 8 clients. I'm surprised this is
costing that much, though. It doesn't seem like it should.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#2)
Re: [WIP] Performance Improvement by reducing WAL for Update Operation

On Tue, Oct 23, 2012 at 08:21:54PM -0400, Noah Misch wrote:

-Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8-
HEAD,-F80 816 1644 6528 1821 MiB
xlogscale,-F80 824 1643 6551 1826 MiB
xlogscale+lz,-F80 717 1466 5924 1137 MiB
xlogscale+lz,-F100 753 1508 5948 1548 MiB

Those are short runs with no averaging of multiple iterations; don't put too
much faith in the absolute numbers.

I decided to rerun those measurements with three 15-minute runs. I removed
the -F100 test and added wal_update_changes_v2.patch (delta encoding version)
to the mix. Median results:

-Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8-
HEAD,-F80 832 1679 6797 44 GiB
scale,-F80 830 1679 6798 44 GiB
scale+lz,-F80 736 1498 6169 11 GiB
scale+delta,-F80 841 1713 7056 10 GiB

The numbers varied little across runs. So we see the same general trends as
with the short runs; overall performance is slightly higher across the board,
and the fraction of WAL avoided is much higher. I'm suspecting the patches
shrink WAL better in these longer runs because the WAL of a short run contains
a higher density of full-page images.

From these results, I think that the LZ approach is something we could only
provide as an option; CPU-bound workloads may not be our bread and butter, but
we shouldn't dock them 10% with no option to disable. Amit's delta encoding
approach seems to be something we could safely enable across the board.

Naturally, there are other compression and delta encoding schemes. Does
anyone feel the need to explore further alternatives?

We might eventually find the need for multiple, user-selectable, WAL
compression strategies. I don't recommend taking that step yet.

nm

#7Jesper Krogh
jesper@krogh.cc
In reply to: Noah Misch (#6)
Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation

Naturally, there are other compression and delta encoding schemes. Does
anyone feel the need to explore further alternatives?

We might eventually find the need for multiple, user-selectable, WAL
compression strategies. I don't recommend taking that step yet.

my currently implemented compression strategy is to run the wal block through gzip in the archive command. compresses pretty nicely and achieved 50%+ in my workload (generally closer to 70)

on a multi core system it will take more cpu time but on a different core and not have any effect on tps.

General compression should probably only be applied if it have positive gain on tps you could.

Jesper

#8Amit kapila
amit.kapila@huawei.com
In reply to: Noah Misch (#6)
2 attachment(s)
Re: Performance Improvement by reducing WAL for Update Operation

On Thursday, October 25, 2012 5:43 AM, Noah Misch wrote:
On Tue, Oct 23, 2012 at 08:21:54PM -0400, Noah Misch wrote:

-Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8-
HEAD,-F80 816 1644 6528 1821 MiB
xlogscale,-F80 824 1643 6551 1826 MiB
xlogscale+lz,-F80 717 1466 5924 1137 MiB
xlogscale+lz,-F100 753 1508 5948 1548 MiB

Those are short runs with no averaging of multiple iterations; don't put too
much faith in the absolute numbers.

I decided to rerun those measurements with three 15-minute runs. I removed
the -F100 test and added wal_update_changes_v2.patch (delta encoding version)
to the mix. Median results:

-Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8-
HEAD,-F80 832 1679 6797 44 GiB
scale,-F80 830 1679 6798 44 GiB
scale+lz,-F80 736 1498 6169 11 GiB
scale+delta,-F80 841 1713 7056 10 GiB

The numbers varied little across runs. So we see the same general trends as
with the short runs; overall performance is slightly higher across the board,
and the fraction of WAL avoided is much higher. I'm suspecting the patches
shrink WAL better in these longer runs because the WAL of a short run contains
a higher density of full-page images.

I have fixed all the review comments raised in delta encoding approach raised by you (for needs toast, for now I have kept the code as it will not go in patch of encoding for it. however it can be changed.).
I have also fixed the major comment about this patch by Heikki and Tom [use memcmp to find modified columns].
The patch containing review comments fixed for delta encoding method is attached with this mail.

The readings with modified patch are as below, the detailed configuration used in the file attached:

-Patch- -tps@-c1- -tps@-c2- -tps@-c8-
scale,-F80 834 1451 2701
scale+lz,-F80 659 1276 4650
scale+delta+review_fixed,-F80 873 1704 5509

The results are similar to your findings except for 8 client result.
I have one doubt that my m/c is 4core m/c on which I am taking data whereas yours is 8 core.

So tommorow I shall post the results with 1,2,4,8 clients as well.
Any further suggestions?

With Regards,
Amit Kapila.

Attachments:

wal_update_changes_v3.patchapplication/octet-stream; name=wal_update_changes_v3.patchDownload
*** a/src/backend/access/common/heaptuple.c
--- b/src/backend/access/common/heaptuple.c
***************
*** 60,65 ****
--- 60,66 ----
  #include "access/sysattr.h"
  #include "access/tuptoaster.h"
  #include "executor/tuptable.h"
+ #include "utils/datum.h"
  
  
  /* Does att's datatype allow packing into the 1-byte-header varlena format? */
***************
*** 69,74 ****
--- 70,80 ----
  #define VARLENA_ATT_IS_PACKABLE(att) \
  	((att)->attstorage != 'p')
  
+ /* WAL Diff update options */
+ #define HEAP_UPDATE_WAL_OPT_COPY 0
+ #define HEAP_UPDATE_WAL_OPT_ADD  1
+ #define HEAP_UPDATE_WAL_OPT_IGN  2
+ #define HEAP_UPDATE_WAL_OPT_PAD  3
  
  /* ----------------------------------------------------------------
   *						misc support routines
***************
*** 618,623 **** heap_copytuple_with_tuple(HeapTuple src, HeapTuple dest)
--- 624,1140 ----
  }
  
  /*
+  * Check if the specified attribute's value is same in both given tuples.
+  * Subroutine for HeapSatisfiesHOTUpdate.
+  */
+ bool
+ heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
+ 					   HeapTuple tup1, HeapTuple tup2)
+ {
+ 	Datum		value1,
+ 				value2;
+ 	bool		isnull1,
+ 				isnull2;
+ 	Form_pg_attribute att;
+ 
+ 	/*
+ 	 * If it's a whole-tuple reference, say "not equal".  It's not really
+ 	 * worth supporting this case, since it could only succeed after a no-op
+ 	 * update, which is hardly a case worth optimizing for.
+ 	 */
+ 	if (attrnum == 0)
+ 		return false;
+ 
+ 	/*
+ 	 * Likewise, automatically say "not equal" for any system attribute other
+ 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
+ 	 * chain, or even to be set correctly yet in the new tuple.
+ 	 */
+ 	if (attrnum < 0)
+ 	{
+ 		if (attrnum != ObjectIdAttributeNumber &&
+ 			attrnum != TableOidAttributeNumber)
+ 			return false;
+ 	}
+ 
+ 	/*
+ 	 * Extract the corresponding values.  XXX this is pretty inefficient if
+ 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
+ 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
+ 	 * work for system columns ...
+ 	 */
+ 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
+ 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
+ 
+ 	/*
+ 	 * If one value is NULL and other is not, then they are certainly not
+ 	 * equal
+ 	 */
+ 	if (isnull1 != isnull2)
+ 		return false;
+ 
+ 	/*
+ 	 * If both are NULL, they can be considered equal.
+ 	 */
+ 	if (isnull1)
+ 		return true;
+ 
+ 	/*
+ 	 * We do simple binary comparison of the two datums.  This may be overly
+ 	 * strict because there can be multiple binary representations for the
+ 	 * same logical value.	But we should be OK as long as there are no false
+ 	 * positives.  Using a type-specific equality operator is messy because
+ 	 * there could be multiple notions of equality in different operator
+ 	 * classes; furthermore, we cannot safely invoke user-defined functions
+ 	 * while holding exclusive buffer lock.
+ 	 */
+ 	if (attrnum <= 0)
+ 	{
+ 		/* The only allowed system columns are OIDs, so do this */
+ 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
+ 	}
+ 	else
+ 	{
+ 		Assert(attrnum <= tupdesc->natts);
+ 		att = tupdesc->attrs[attrnum - 1];
+ 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
+ 	}
+ }
+ 
+ 
+ /*
+  * get_tuple_info - Gets the tuple offset and value.
+  *
+  * calculates the attribute value and offset, where the attribute ends in the
+  * tuple based on the attribute number and previous fetched attribute info.
+  *
+  * offset (I/P and O/P variable) - Input as end of previous attribute offset
+  *		and incase if it is a first attribute then it's value is zero.
+  *		Output as end of the current attribute in the tuple.
+  * usecacheoff (I/P and O/P variable) - Attribute cacheoff can be used or not.
+  */
+ static void
+ get_tuple_info(Form_pg_attribute *att, HeapTuple tuple, bits8 *bp,
+ 			   bool hasnulls, int attnum, Datum *value, uint16 *offset,
+ 			   bool *usecacheoff)
+ {
+ 	Form_pg_attribute thisatt = att[attnum];
+ 	uint16		off = *offset;
+ 	bool		slow = *usecacheoff;
+ 	char	   *tp;
+ 	HeapTupleHeader tup = tuple->t_data;
+ 
+ 	tp = (char *) tup + tup->t_hoff;
+ 
+ 	if (hasnulls && att_isnull(attnum, bp))
+ 	{
+ 		slow = true;			/* can't use attcacheoff anymore */
+ 		*offset = off;
+ 		*usecacheoff = slow;
+ 		return;
+ 	}
+ 
+ 	if (!slow && thisatt->attcacheoff >= 0)
+ 		off = thisatt->attcacheoff;
+ 	else if (thisatt->attlen == -1)
+ 	{
+ 		/*
+ 		 * We can only cache the offset for a varlena attribute if the offset
+ 		 * is already suitably aligned, so that there would be no pad bytes in
+ 		 * any case: then the offset will be valid for either an aligned or
+ 		 * unaligned value.
+ 		 */
+ 		if (!slow &&
+ 			off == att_align_nominal(off, thisatt->attalign))
+ 			thisatt->attcacheoff = off;
+ 		else
+ 		{
+ 			off = att_align_pointer(off, thisatt->attalign, -1,
+ 									tp + off);
+ 			slow = true;
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/* not varlena, so safe to use att_align_nominal */
+ 		off = att_align_nominal(off, thisatt->attalign);
+ 
+ 		if (!slow)
+ 			thisatt->attcacheoff = off;
+ 	}
+ 
+ 	*value = fetchatt(thisatt, tp + off);
+ 
+ 	off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ 
+ 	if (thisatt->attlen <= 0)
+ 		slow = true;			/* can't use attcacheoff anymore */
+ 
+ 	*offset = off;
+ 	*usecacheoff = slow;
+ }
+ 
+ 
+ /*
+  * heap_delta_encode
+  *		Forms a diff tuple from old and new tuple with the modified columns.
+  *
+  *		att - attribute list.
+  *		oldtup - pointer to the old tuple.
+  *		heaptup - pointer to the modified tuple.
+  *		wal_tup - pointer to the wal record which needs to be formed from old
+  *				  and new tuples by using the modified columns list.
+  */
+ bool
+ heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				  HeapTuple heaptup, HeapTuple wal_tup)
+ {
+ 	Form_pg_attribute *att = tupleDesc->attrs;
+ 	int			numberOfAttributes;
+ 	uint16		cur_offset = 0,
+ 				prev_offset = 0,
+ 				offset = 0;
+ 	int			attnum;
+ 	HeapTupleHeader newtuphdr = heaptup->t_data;
+ 	bits8	   *new_bp = newtuphdr->t_bits,
+ 			   *old_bp = oldtup->t_data->t_bits;
+ 	bool		old_hasnulls = HeapTupleHasNulls(oldtup);
+ 	bool		new_hasnulls = HeapTupleHasNulls(heaptup);
+ 	bool		cur_usecacheoff = false,
+ 				prev_usecacheoff = false;
+ 	Datum		cur_value,
+ 				prev_value;
+ 	uint16		data_length;
+ 	bool		check_for_padding = false;
+ 	char	   *data;
+ 	uint16		wal_offset = 0;
+ 
+ 	numberOfAttributes = HeapTupleHeaderGetNatts(newtuphdr);
+ 
+ 	data = (char *) wal_tup->t_data;
+ 	wal_offset = newtuphdr->t_hoff;
+ 
+ 	/* Copy the tuple header to the WAL tuple */
+ 	memcpy(data, heaptup->t_data, wal_offset);
+ 
+ 	for (attnum = 0; attnum < numberOfAttributes; attnum++)
+ 	{
+ 		/*
+ 		 * If the attribute is modified by the update operation, store the
+ 		 * appropiate offsets in the WAL record, otherwise skip to the next
+ 		 * attribute.
+ 		 */
+ 		if (!heap_tuple_attr_equals(tupleDesc, (attnum + 1), oldtup, heaptup))
+ 		{
+ 			check_for_padding = true;
+ 
+ 			/*
+ 			 * calculate the offset where the modified attribute starts in the
+ 			 * old tuple used to store in the WAL record, this will be used to
+ 			 * traverse the old tuple during recovery.
+ 			 */
+ 			if (prev_offset)
+ 			{
+ 				*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_COPY;
+ 				wal_offset += sizeof(uint8);
+ 
+ 				wal_offset = SHORTALIGN(wal_offset);
+ 
+ 				*(uint16 *) (data + wal_offset) = prev_offset;
+ 				wal_offset += sizeof(uint16);
+ 			}
+ 
+ 			/* calculate the old tuple field length which needs to ignored */
+ 			offset = prev_offset;
+ 			get_tuple_info(att, oldtup, old_bp, old_hasnulls, attnum,
+ 						   &prev_value, &prev_offset, &prev_usecacheoff);
+ 
+ 			data_length = prev_offset - offset;
+ 
+ 			if (data_length)
+ 			{
+ 				*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_IGN;
+ 				wal_offset += sizeof(uint8);
+ 
+ 				wal_offset = SHORTALIGN(wal_offset);
+ 
+ 				*(uint16 *) (data + wal_offset) = data_length;
+ 				wal_offset += sizeof(uint16);
+ 			}
+ 
+ 			/*
+ 			 * calculate the new tuple field start position to check whether
+ 			 * any padding is required or not.
+ 			 */
+ 			offset = cur_offset;
+ 			cur_offset = att_align_pointer(cur_offset,
+ 								  att[attnum]->attalign, att[attnum]->attlen,
+ 						(char *) newtuphdr + newtuphdr->t_hoff + cur_offset);
+ 
+ 			data_length = cur_offset - offset;
+ 
+ 			/*
+ 			 * The above calculation is required to identify, that any
+ 			 * alignment is required or not. And the padding command is added
+ 			 * only incase of that the data is not NULL. which is done at
+ 			 * below.
+ 			 */
+ 
+ 			offset = cur_offset;
+ 			get_tuple_info(att, heaptup, new_bp, new_hasnulls, attnum,
+ 						   &cur_value, &cur_offset, &cur_usecacheoff);
+ 
+ 			/* if the new tuple data is null then nothing is required to add */
+ 			if (new_hasnulls && att_isnull(attnum, new_bp))
+ 			{
+ 				continue;
+ 			}
+ 
+ 			/* Add the padding if requires as the data is not NULL */
+ 			if (data_length)
+ 			{
+ 				*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_PAD;
+ 				wal_offset += sizeof(uint8);
+ 
+ 				*(uint8 *) (data + wal_offset) = data_length;
+ 				wal_offset += sizeof(uint8);
+ 			}
+ 
+ 			/* get the attribute value and end offset for same */
+ 			*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_ADD;
+ 			wal_offset += sizeof(uint8);
+ 
+ 			wal_offset = SHORTALIGN(wal_offset);
+ 
+ 			data_length = cur_offset - offset;
+ 			*(uint16 *) (data + wal_offset) = data_length;
+ 			wal_offset += sizeof(uint16);
+ 
+ 			if (att[attnum]->attbyval)
+ 			{
+ 				/* pass-by-value */
+ 				char		tempdata[sizeof(Datum)];
+ 
+ 				/*
+ 				 * Here we are not storing the data as aligned in the WAL
+ 				 * record as we don't have the tuple descriptor while
+ 				 * replaying the xlog.
+ 				 *
+ 				 * But this alignment is of the data is taken care while
+ 				 * framing the tuple during heap_xlog_update.
+ 				 */
+ 				store_att_byval(tempdata,
+ 								cur_value,
+ 								att[attnum]->attlen);
+ 				memcpy((data + wal_offset), tempdata, att[attnum]->attlen);
+ 			}
+ 			else
+ 			{
+ 				memcpy((data + wal_offset),
+ 					   DatumGetPointer(cur_value),
+ 					   data_length);
+ 			}
+ 
+ 			wal_offset += data_length;
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * calculate the old tuple field start position, required to
+ 			 * ignore if any alignmet is present.
+ 			 */
+ 			offset = prev_offset;
+ 			prev_offset = att_align_pointer(prev_offset,
+ 								  att[attnum]->attalign, att[attnum]->attlen,
+ 			 (char *) oldtup->t_data + oldtup->t_data->t_hoff + prev_offset);
+ 
+ 			data_length = prev_offset - offset;
+ 
+ 			/*
+ 			 * calculate the new tuple field start position to check whether
+ 			 * any padding is required or not because field alignment.
+ 			 */
+ 			offset = cur_offset;
+ 			cur_offset = att_align_pointer(cur_offset,
+ 								  att[attnum]->attalign, att[attnum]->attlen,
+ 						(char *) newtuphdr + newtuphdr->t_hoff + cur_offset);
+ 
+ 			if (data_length != (cur_offset - offset))
+ 			{
+ 				if (!check_for_padding)
+ 				{
+ 					*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_COPY;
+ 					wal_offset += sizeof(uint8);
+ 
+ 					wal_offset = SHORTALIGN(wal_offset);
+ 
+ 					*(uint16 *) (data + wal_offset) = (prev_offset - data_length);
+ 					wal_offset += sizeof(uint16);
+ 				}
+ 
+ 				if (data_length)
+ 				{
+ 					*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_IGN;
+ 					wal_offset += sizeof(uint8);
+ 
+ 					wal_offset = SHORTALIGN(wal_offset);
+ 
+ 					*(uint16 *) (data + wal_offset) = data_length;
+ 					wal_offset += sizeof(uint16);
+ 				}
+ 
+ 				data_length = cur_offset - offset;
+ 
+ 				if (data_length)
+ 				{
+ 					*(uint8 *) (data + wal_offset) = HEAP_UPDATE_WAL_OPT_PAD;
+ 					wal_offset += sizeof(uint8);
+ 
+ 					*(uint8 *) (data + wal_offset) = data_length;
+ 					wal_offset += sizeof(uint8);
+ 				}
+ 			}
+ 
+ 			get_tuple_info(att, oldtup, old_bp, old_hasnulls, attnum,
+ 						   &prev_value, &prev_offset, &prev_usecacheoff);
+ 
+ 			get_tuple_info(att, heaptup, new_bp, new_hasnulls, attnum,
+ 						   &cur_value, &cur_offset, &cur_usecacheoff);
+ 
+ 			check_for_padding = false;
+ 		}
+ 	}
+ 
+ 	if (wal_offset < heaptup->t_len)
+ 	{
+ 		wal_tup->t_len = wal_offset;
+ 		wal_tup->t_self = heaptup->t_self;
+ 		wal_tup->t_tableOid = heaptup->t_tableOid;
+ 
+ 		return true;
+ 	}
+ 
+ 	memcpy(wal_tup, heaptup, sizeof(HeapTuple));
+ 	return false;
+ }
+ 
+ /*
+  * heap_delta_decode
+  *		deforms a diff tuple and forms the new tuple with the help of old tuple.
+  *
+  * The WAL record data is in the format as below
+  *
+  *	COPY + offset until copy required
+  *	IGN + length needs to be ignored from the old tuple.
+  *	PAD + length needs to padded with zero in new tuple.
+  *	ADD + length of data + data which is modified.
+  *
+  * For the COPY command, copy the specified length from old tuple.
+  *
+  * Once the old tuple data copied, then increase the offset by the
+  * copied length.
+  *
+  * For the IGN command, ignore the specified length in the old tuple.
+  *
+  * For the PAD command, fill with zeros of the specified length in
+  * the new tuple.
+  *
+  * For the ADD command, copy the corresponding length of data from WAL
+  * record to the new tuple.
+  *
+  * Repeat this procedure until the WAL record reaches the end.
+  *
+  * If any remaining left out old tuple data will be copied at last.
+  *
+  *	htup - old tuple data pointer from which new tuple needs to be formed.
+  *	old_tup_len - old tuple length.
+  *	data - pointer to the new tuple which needs to be framed.
+  *	new_tup_len - output of new tuple data length.
+  *	waldata - wal record pointer from which the new tuple needs to formed.
+  *	wal_len - wal record length.
+  */
+ void
+ heap_delta_decode(HeapTupleHeader htup, uint32 old_tup_len, char *data,
+ 				  uint32 *new_tup_len, char *waldata, uint32 wal_len)
+ {
+ 	uint8		command;
+ 	uint16		len = 0,
+ 				data_length,
+ 				prev_offset = 0,
+ 				cur_offset = 0;
+ 	char	   *olddata = (char *) htup + htup->t_hoff;
+ 
+ 	old_tup_len -= htup->t_hoff;
+ 
+ 	/*
+ 	 * Frame the new tuple from old tuple and WAL record
+ 	 */
+ 	len = 0;
+ 
+ 	/* Frame the new tuple from the old and WAL tuples */
+ 	while (len < wal_len)
+ 	{
+ 		command = *(uint8 *) (waldata + len);
+ 		len += sizeof(uint8);
+ 
+ 		switch (command)
+ 		{
+ 			case HEAP_UPDATE_WAL_OPT_COPY:
+ 				len = SHORTALIGN(len);
+ 				data_length = *(uint16 *) (waldata + len) - prev_offset;
+ 
+ 				/* Copy the old tuple data */
+ 				memcpy((data + cur_offset),
+ 					   (olddata + prev_offset),
+ 					   data_length);
+ 				cur_offset += data_length;
+ 				prev_offset += data_length;
+ 
+ 				len += sizeof(uint16);
+ 				break;
+ 			case HEAP_UPDATE_WAL_OPT_ADD:
+ 				len = SHORTALIGN(len);
+ 				data_length = *(uint16 *) (waldata + len);
+ 				len += sizeof(uint16);
+ 
+ 				/* Copy the modified attribute data from WAL record */
+ 				memcpy((data + cur_offset), (waldata + len), data_length);
+ 				cur_offset += data_length;
+ 				len += data_length;
+ 				break;
+ 			case HEAP_UPDATE_WAL_OPT_IGN:
+ 				len = SHORTALIGN(len);
+ 				data_length = *(uint16 *) (waldata + len);
+ 
+ 				/* Skip the oldtuple with data length in the WAL record */
+ 				prev_offset += data_length;
+ 				len += sizeof(uint16);
+ 				break;
+ 			case HEAP_UPDATE_WAL_OPT_PAD:
+ 				data_length = *(uint8 *) (waldata + len);
+ 				cur_offset += data_length;
+ 				len += sizeof(uint8);
+ 				break;
+ 			default:
+ 				Assert(0);
+ 				break;
+ 		}
+ 	}
+ 
+ 	/* Copy the remaining old tuple data to the new tuple */
+ 	if (prev_offset < old_tup_len)
+ 	{
+ 		memcpy((data + cur_offset),
+ 			   (olddata + prev_offset),
+ 			   (old_tup_len - prev_offset));
+ 		cur_offset += (old_tup_len - prev_offset);
+ 	}
+ 
+ 	*new_tup_len = cur_offset;
+ }
+ 
+ 
+ /*
   * heap_form_tuple
   *		construct a tuple from the given values[] and isnull[] arrays,
   *		which are of the length indicated by tupleDescriptor->natts
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 71,77 ****
  #include "utils/syscache.h"
  #include "utils/tqual.h"
  
- 
  /* GUC variable */
  bool		synchronize_seqscans = true;
  
--- 71,76 ----
***************
*** 85,91 **** static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
! 				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
  
--- 84,91 ----
  					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
! 				bool all_visible_cleared, bool new_all_visible_cleared,
! 				bool diff_update);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
  
***************
*** 2737,2742 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
--- 2737,2743 ----
  	ItemId		lp;
  	HeapTupleData oldtup;
  	HeapTuple	heaptup;
+ 	HeapTupleData wal_tup;
  	Page		page;
  	BlockNumber block;
  	Buffer		buffer,
***************
*** 2752,2757 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
--- 2753,2763 ----
  	bool		use_hot_update = false;
  	bool		all_visible_cleared = false;
  	bool		all_visible_cleared_new = false;
+ 	struct
+ 	{
+ 		HeapTupleHeaderData hdr;
+ 		char		data[MaxHeapTupleSize];
+ 	}			tbuf;
  
  	Assert(ItemPointerIsValid(otid));
  
***************
*** 3195,3204 **** l2:
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 											 newbuf, heaptup,
! 											 all_visible_cleared,
! 											 all_visible_cleared_new);
  
  		if (newbuf != buffer)
  		{
--- 3201,3232 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(relation))
  	{
! 		XLogRecPtr	recptr;
! 		bool		is_delta_update = false;
! 
! 		/*
! 		 * Apply the xlog diff update algorithm only for hot updates.
! 		 */
! 		if (!need_toast && (newbuf == buffer))
! 		{
! 			wal_tup.t_data = (HeapTupleHeader) &tbuf;
! 			is_delta_update = heap_delta_encode(relation->rd_att, &oldtup,
! 												heaptup, &wal_tup);
! 
! 			recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 									 newbuf, &wal_tup,
! 									 all_visible_cleared,
! 									 all_visible_cleared_new,
! 									 is_delta_update);
! 		}
! 		else
! 		{
! 			recptr = log_heap_update(relation, buffer, oldtup.t_self,
! 									 newbuf, heaptup,
! 									 all_visible_cleared,
! 									 all_visible_cleared_new,
! 									 is_delta_update);
! 		}
  
  		if (newbuf != buffer)
  		{
***************
*** 3258,3341 **** l2:
  }
  
  /*
-  * Check if the specified attribute's value is same in both given tuples.
-  * Subroutine for HeapSatisfiesHOTUpdate.
-  */
- static bool
- heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
- 					   HeapTuple tup1, HeapTuple tup2)
- {
- 	Datum		value1,
- 				value2;
- 	bool		isnull1,
- 				isnull2;
- 	Form_pg_attribute att;
- 
- 	/*
- 	 * If it's a whole-tuple reference, say "not equal".  It's not really
- 	 * worth supporting this case, since it could only succeed after a no-op
- 	 * update, which is hardly a case worth optimizing for.
- 	 */
- 	if (attrnum == 0)
- 		return false;
- 
- 	/*
- 	 * Likewise, automatically say "not equal" for any system attribute other
- 	 * than OID and tableOID; we cannot expect these to be consistent in a HOT
- 	 * chain, or even to be set correctly yet in the new tuple.
- 	 */
- 	if (attrnum < 0)
- 	{
- 		if (attrnum != ObjectIdAttributeNumber &&
- 			attrnum != TableOidAttributeNumber)
- 			return false;
- 	}
- 
- 	/*
- 	 * Extract the corresponding values.  XXX this is pretty inefficient if
- 	 * there are many indexed columns.	Should HeapSatisfiesHOTUpdate do a
- 	 * single heap_deform_tuple call on each tuple, instead?  But that doesn't
- 	 * work for system columns ...
- 	 */
- 	value1 = heap_getattr(tup1, attrnum, tupdesc, &isnull1);
- 	value2 = heap_getattr(tup2, attrnum, tupdesc, &isnull2);
- 
- 	/*
- 	 * If one value is NULL and other is not, then they are certainly not
- 	 * equal
- 	 */
- 	if (isnull1 != isnull2)
- 		return false;
- 
- 	/*
- 	 * If both are NULL, they can be considered equal.
- 	 */
- 	if (isnull1)
- 		return true;
- 
- 	/*
- 	 * We do simple binary comparison of the two datums.  This may be overly
- 	 * strict because there can be multiple binary representations for the
- 	 * same logical value.	But we should be OK as long as there are no false
- 	 * positives.  Using a type-specific equality operator is messy because
- 	 * there could be multiple notions of equality in different operator
- 	 * classes; furthermore, we cannot safely invoke user-defined functions
- 	 * while holding exclusive buffer lock.
- 	 */
- 	if (attrnum <= 0)
- 	{
- 		/* The only allowed system columns are OIDs, so do this */
- 		return (DatumGetObjectId(value1) == DatumGetObjectId(value2));
- 	}
- 	else
- 	{
- 		Assert(attrnum <= tupdesc->natts);
- 		att = tupdesc->attrs[attrnum - 1];
- 		return datumIsEqual(value1, value2, att->attbyval, att->attlen);
- 	}
- }
- 
- /*
   * Check if the old and new tuples represent a HOT-safe update. To be able
   * to do a HOT update, we must not have changed any columns used in index
   * definitions.
--- 3286,3291 ----
***************
*** 4429,4435 **** log_heap_visible(RelFileNode rnode, BlockNumber block, Buffer vm_buffer,
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  				Buffer newbuf, HeapTuple newtup,
! 				bool all_visible_cleared, bool new_all_visible_cleared)
  {
  	xl_heap_update xlrec;
  	xl_heap_header xlhdr;
--- 4379,4386 ----
  static XLogRecPtr
  log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  				Buffer newbuf, HeapTuple newtup,
! 				bool all_visible_cleared, bool new_all_visible_cleared,
! 				bool delta_update)
  {
  	xl_heap_update xlrec;
  	xl_heap_header xlhdr;
***************
*** 4446,4456 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from,
  	else
  		info = XLOG_HEAP_UPDATE;
  
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	xlrec.all_visible_cleared = all_visible_cleared;
  	xlrec.newtid = newtup->t_self;
! 	xlrec.new_all_visible_cleared = new_all_visible_cleared;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
--- 4397,4412 ----
  	else
  		info = XLOG_HEAP_UPDATE;
  
+ 	xlrec.flags = 0;
  	xlrec.target.node = reln->rd_node;
  	xlrec.target.tid = from;
! 	if (all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED;
  	xlrec.newtid = newtup->t_self;
! 	if (new_all_visible_cleared)
! 		xlrec.flags |= XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED;
! 	if (delta_update)
! 		xlrec.flags |= XL_HEAP_UPDATE_DELTA_ENCODED;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapUpdate;
***************
*** 5232,5237 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
--- 5188,5195 ----
  	OffsetNumber offnum;
  	ItemId		lp = NULL;
  	HeapTupleHeader htup;
+ 	HeapTupleHeader oldtup = NULL;
+ 	uint32		old_tup_len = 0;
  	struct
  	{
  		HeapTupleHeaderData hdr;
***************
*** 5246,5252 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
--- 5204,5210 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->target.tid);
***************
*** 5289,5295 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	htup = (HeapTupleHeader) PageGetItem(page, lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
--- 5247,5254 ----
  	if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
  		elog(PANIC, "heap_update_redo: invalid lp");
  
! 	oldtup = htup = (HeapTupleHeader) PageGetItem(page, lp);
! 	old_tup_len = ItemIdGetLength(lp);
  
  	htup->t_infomask &= ~(HEAP_XMAX_COMMITTED |
  						  HEAP_XMAX_INVALID |
***************
*** 5308,5314 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool hot_update)
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->all_visible_cleared)
  		PageClearAllVisible(page);
  
  	/*
--- 5267,5273 ----
  	/* Mark the page as a candidate for pruning */
  	PageSetPrunable(page, record->xl_xid);
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	/*
***************
*** 5330,5336 **** newt:;
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->new_all_visible_cleared)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
--- 5289,5295 ----
  	 * The visibility map may need to be fixed even if the heap page is
  	 * already up-to-date.
  	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  	{
  		Relation	reln = CreateFakeRelcacheEntry(xlrec->target.node);
  		BlockNumber block = ItemPointerGetBlockNumber(&xlrec->newtid);
***************
*** 5380,5395 **** newsame:;
  	hsize = SizeOfHeapUpdate + SizeOfHeapHeader;
  
  	newlen = record->xl_len - hsize;
! 	Assert(newlen <= MaxHeapTupleSize);
  	memcpy((char *) &xlhdr,
  		   (char *) xlrec + SizeOfHeapUpdate,
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 	/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 	memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 		   (char *) xlrec + hsize,
! 		   newlen);
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
--- 5339,5386 ----
  	hsize = SizeOfHeapUpdate + SizeOfHeapHeader;
  
  	newlen = record->xl_len - hsize;
! 
  	memcpy((char *) &xlhdr,
  		   (char *) xlrec + SizeOfHeapUpdate,
  		   SizeOfHeapHeader);
  	htup = &tbuf.hdr;
  	MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
! 
! 	/*
! 	 * If the new tuple was delta-encoded, decode it.
! 	 */
! 	if (xlrec->flags & XL_HEAP_UPDATE_DELTA_ENCODED)
! 	{
! 		int			bitmap_len;
! 		char	   *data = (char *) &tbuf.hdr;
! 		uint32		wal_len;
! 		char	   *waldata;
! 
! 		Assert(oldtup != NULL);
! 
! 		bitmap_len = (xlhdr.t_hoff - offsetof(HeapTupleHeaderData, t_bits));
! 		wal_len = record->xl_len - hsize - bitmap_len;
! 		waldata = (char *) xlrec + hsize + bitmap_len;
! 
! 		/* copy exactly the tuple header present in the WAL to new tuple */
! 		memcpy(data + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   bitmap_len);
! 
! 		data += xlhdr.t_hoff;
! 
! 		heap_delta_decode(oldtup, old_tup_len, data, &newlen, waldata, wal_len);
! 		newlen += bitmap_len;
! 	}
! 	else
! 	{
! 		Assert(newlen <= MaxHeapTupleSize);
! 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
! 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
! 			   (char *) xlrec + hsize,
! 			   newlen);
! 	}
! 
  	newlen += offsetof(HeapTupleHeaderData, t_bits);
  	htup->t_infomask2 = xlhdr.t_infomask2;
  	htup->t_infomask = xlhdr.t_infomask;
***************
*** 5404,5410 **** newsame:;
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->new_all_visible_cleared)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
--- 5395,5401 ----
  	if (offnum == InvalidOffsetNumber)
  		elog(PANIC, "heap_update_redo: failed to add tuple");
  
! 	if (xlrec->flags & XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED)
  		PageClearAllVisible(page);
  
  	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 142,153 **** typedef struct xl_heap_update
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
! 	bool		new_all_visible_cleared;		/* same for the page of newtid */
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
--- 142,157 ----
  {
  	xl_heaptid	target;			/* deleted tuple id */
  	ItemPointerData newtid;		/* new inserted tuple id */
! 	char		flags;
! 
  	/* NEW TUPLE xl_heap_header AND TUPLE DATA FOLLOWS AT END OF STRUCT */
  } xl_heap_update;
  
! #define XL_HEAP_UPDATE_ALL_VISIBLE_CLEARED		0x01
! #define XL_HEAP_UPDATE_NEW_ALL_VISIBLE_CLEARED	0x02
! #define XL_HEAP_UPDATE_DELTA_ENCODED			0x04
! 
! #define SizeOfHeapUpdate	(offsetof(xl_heap_update, flags) + sizeof(char))
  
  /*
   * This is what we need to know about vacuum page cleanup/redirect
*** a/src/include/access/htup_details.h
--- b/src/include/access/htup_details.h
***************
*** 527,532 **** struct MinimalTupleData
--- 527,538 ----
  #define HeapTupleSetOid(tuple, oid) \
  		HeapTupleHeaderSetOid((tuple)->t_data, (oid))
  
+ /*
+  * Minimum tuple length required by the tuple during update operation for doing
+  * WAL optimization of update operation.
+  */
+ #define MinHeapTupleSizeForDeltaUpdate 64
+ 
  
  /* ----------------
   *		fastgetattr
***************
*** 636,641 **** extern HeapTuple heap_modify_tuple(HeapTuple tuple,
--- 642,654 ----
  extern void heap_deform_tuple(HeapTuple tuple, TupleDesc tupleDesc,
  				  Datum *values, bool *isnull);
  
+ extern bool heap_tuple_attr_equals(TupleDesc tupdesc, int attrnum,
+ 					   HeapTuple tup1, HeapTuple tup2);
+ extern bool heap_delta_encode(TupleDesc tupleDesc, HeapTuple oldtup,
+ 				   HeapTuple heaptup, HeapTuple wal_tup);
+ extern void heap_delta_decode(HeapTupleHeader htup, uint32 old_tup_len,
+ 				char *data, uint32 *new_tup_len, char *waldata, uint32 wal_len);
+ 
  /* these three are deprecated versions of the three above: */
  extern HeapTuple heap_formtuple(TupleDesc tupleDescriptor,
  			   Datum *values, char *nulls);
pgbench_wal_memcmp.htmtext/html; name=pgbench_wal_memcmp.htmDownload
#9Amit Kapila
amit.kapila@huawei.com
In reply to: Amit kapila (#8)
Re: Performance Improvement by reducing WAL for Update Operation

On Thursday, October 25, 2012 11:04 PM Amit kapila wrote:

On Thursday, October 25, 2012 5:43 AM, Noah Misch wrote:
On Tue, Oct 23, 2012 at 08:21:54PM -0400, Noah Misch wrote:

-Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8-
HEAD,-F80 816 1644 6528 1821 MiB
xlogscale,-F80 824 1643 6551 1826 MiB
xlogscale+lz,-F80 717 1466 5924 1137 MiB
xlogscale+lz,-F100 753 1508 5948 1548 MiB

Those are short runs with no averaging of multiple iterations; don't
put too much faith in the absolute numbers.

I decided to rerun those measurements with three 15-minute runs. I
removed the -F100 test and added wal_update_changes_v2.patch (delta
encoding version) to the mix. Median results:

-Patch- -tps@-c1- -tps@-c2- -tps@-c8- -WAL@-c8-
HEAD,-F80 832 1679 6797 44 GiB
scale,-F80 830 1679 6798 44 GiB
scale+lz,-F80 736 1498 6169 11 GiB
scale+delta,-F80 841 1713 7056 10 GiB

The numbers varied little across runs. So we see the same general
trends as with the short runs; overall performance is slightly higher
across the board, and the fraction of WAL avoided is much higher. I'm
suspecting the patches shrink WAL better in these longer runs because
the WAL of a short run contains a higher density of full-page images.

I have fixed all the review comments raised in delta encoding approach
raised by you (for needs toast, for now I have kept the code as it will
not go in patch of encoding for it. however it can be changed.).
I have also fixed the major comment about this patch by Heikki and Tom
[use memcmp to find modified columns].
The patch containing review comments fixed for delta encoding method is
attached with this mail.

The readings with modified patch are as below, the detailed
configuration used in the file attached:

-Patch- -tps@-c1- -tps@-c2- -tps@-
c8-
scale,-F80 834 1451
2701
scale+lz,-F80 659 1276
4650
scale+delta+review_fixed,-F80 873 1704 5509

The results are similar to your findings except for 8 client result.
I have one doubt that my m/c is 4core m/c on which I am taking data
whereas yours is 8 core.

So tommorow I shall post the results with 1,2,4,8 clients as well.
Any further suggestions?

The data with 1,2,4,8 clients is as below:

-Patch- -tps@-c1- -tps@-c2- -tps@-c4- -tps@-c8-
HEAD,-F80 874 1350 2129 2733
scale,-F80 884 1393 2106 2671
scale+lz,-F80 722 1342 2646 4568
scale+delta,-F80 892 1670 3323 5481

From the above readings, it is clear that delta encoding approach has better
performance and scaling.
Do you see the need of any further investigation from myside?

With Regards,
Amit Kapila.

#10Noah Misch
noah@leadboat.com
In reply to: Amit kapila (#8)
Re: Performance Improvement by reducing WAL for Update Operation

On Thu, Oct 25, 2012 at 05:34:14PM +0000, Amit kapila wrote:

I have fixed all the review comments raised in delta encoding approach raised by you (for needs toast, for now I have kept the code as it will not go in patch of encoding for it. however it can be changed.).
I have also fixed the major comment about this patch by Heikki and Tom [use memcmp to find modified columns].

Could you elaborate on your reason for continuing to treat TOAST as a special
case? As best I recall, the only reason to do so before was the fact that
TOAST can change the physical representation of a column even when executor
did not change its logical content. Since you're no longer relying on the
executor's opinion of what changed, a TOASTed value is not special.

The patch containing review comments fixed for delta encoding method is attached with this mail.

In my previous review, I said:

Given [not relying on the executor to know which columns changed], why not
treat the tuple as an opaque series of bytes and not worry about datum
boundaries? When several narrow columns change together, say a sequence
of sixteen smallint columns, you will use fewer binary delta commands by
representing the change with a single 32-byte substitution. If an UPDATE
changes just part of a long datum, the delta encoding algorithm will still
be able to save considerable space. That case arises in many forms:
changing one word in a long string, changing one element in a long array,
changing one field of a composite-typed column. Granted, this makes the
choice of delta encoding algorithm more important.

We may be leaving considerable savings on the table by assuming that column
boundaries are the only modified-range boundaries worth recognizing. What is
your willingness to explore general algorithms for choosing such boundaries?
Such an investigation may, of course, be a dead end.

If you conclude that finding sub-column similarity is not worthwhile, at least
teach your algorithm to aggregate runs of changing or unchanging columns into
fewer delta instructions. If a table contains twenty unchanging bool columns,
you currently use at least 80 bytes to encode that fact. By treating the run
of columns as a unit for delta encoding purposes, you could encode it in 23
bytes. The same applies for runs of changing columns.

Like Heikki, I'm left wondering why your custom delta encoding is
preferable to an encoding from the literature. Your encoding has much in
common with VCDIFF, even sharing two exact command names. If a custom
encoding is the right thing, code comments or a README section should at
least discuss the advantages over an established alternative.

Reflecting on those comments, I'm no longer too concerned about your use of a
novel format to encode the delta. For example, VCDIFF is designed to be
flexible and self-describing; we don't need that. But I would still review it
for useful ideas when designing the encoding for PostgreSQL.

Idle thought: it might pay off to use 1-byte sizes and offsets most of the
time. Tuples shorter than 256 bytes are common; for longer tuples, we can
afford wider offsets.

I still think this deserves attention. You currently need two bits to select
the delta command. Here are a few strategies to consider:

1) Store the commands in groups of eight. Each group starts with two bytes
representing the eight commands, followed by the corresponding 2-byte lengths
and ADD data blocks. The last group may end early.

2) Add extra delta commands HEAP_UPDATE_WAL_OPT_COPY16, _ADD16, _IGN16. Now
each command specifier needs three bits. Again store commands in groups of
eight, this time three bytes per group header. Expect a 1-byte length for the
original commands and a 2-byte length for _*16 commands.

2) Store each 2-bit command with a 14-bit length in a uint16. We would need
to do something grotty for --with-blocksize=32, where a 14-bit length is not
always enough.

I mentioned that heap_delta_{encode,decode} should have essentially-inverse
argument lists. That remains unchanged in this version.

Review the comments added and moved in your patch; some have become obsolete.
At least one is missing detail I requested in the previous review.

+ /*
+  * get_tuple_info - Gets the tuple offset and value.
+  *
+  * calculates the attribute value and offset, where the attribute ends in the
+  * tuple based on the attribute number and previous fetched attribute info.
+  *
+  * offset (I/P and O/P variable) - Input as end of previous attribute offset
+  *		and incase if it is a first attribute then it's value is zero.
+  *		Output as end of the current attribute in the tuple.
+  * usecacheoff (I/P and O/P variable) - Attribute cacheoff can be used or not.
+  */
+ static void
+ get_tuple_info(Form_pg_attribute *att, HeapTuple tuple, bits8 *bp,
+ 			   bool hasnulls, int attnum, Datum *value, uint16 *offset,
+ 			   bool *usecacheoff)

This name is too generic and not particularly tied to what the function does.
As far as I can see, this is a variant of heap_getattr() that also returns the
offset and has some redundant arguments. I suggest heap_getattr_with_offset()
for a name and making its argument list more like heap_getattr().

+ 	if (wal_offset < heaptup->t_len)
+ 	{
+ 		wal_tup->t_len = wal_offset;
+ 		wal_tup->t_self = heaptup->t_self;
+ 		wal_tup->t_tableOid = heaptup->t_tableOid;
+ 
+ 		return true;
+ 	}

The passed-in wal_tup happens to always have MaxHeapTupleSize of storage.
However, heap_delta_encode() cannot verify that and does not document it as an
precondition. Furthermore, it makes no effort to avoid overrunning the buffer
before finally checking the accumulated length here. The encoded data could
have grown well beyond MaxHeapTupleSize.

+ 
+ 	memcpy(wal_tup, heaptup, sizeof(HeapTuple));
+ 	return false;

You would need heap_copytuple_with_tuple() here, but perhaps just specify the
contents of wal_tup as undefined when heap_delta_encode() returns false.

+ }

+ #define MinHeapTupleSizeForDeltaUpdate 64

This is now unused.

Overall, there's still plenty to do before this patch is ready. I'm marking
it Returned with Feedback. I look forward to the benefits it will bring when
finished.

nm

#11Amit Kapila
amit.kapila@huawei.com
In reply to: Noah Misch (#10)
Re: Performance Improvement by reducing WAL for Update Operation

On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:

On Thu, Oct 25, 2012 at 05:34:14PM +0000, Amit kapila wrote:

I have fixed all the review comments raised in delta encoding approach

raised by you (for needs toast, for now I have kept the code as it will
not go in patch of encoding for it. however it can be changed.).

I have also fixed the major comment about this patch by Heikki and Tom

[use memcmp to find modified columns].

Could you elaborate on your reason for continuing to treat TOAST as a
special
case? As best I recall, the only reason to do so before was the fact
that
TOAST can change the physical representation of a column even when
executor
did not change its logical content. Since you're no longer relying on
the
executor's opinion of what changed, a TOASTed value is not special.

I thought for initial version of patch, without this change, patch will have
less impact and less test.
However now in the new version I shall take care of TOASTed value as
suggested by you.

The patch containing review comments fixed for delta encoding method

is attached with this mail.

In my previous review, I said:

Given [not relying on the executor to know which columns changed],
why not
treat the tuple as an opaque series of bytes and not worry about
datum
boundaries? When several narrow columns change together, say a
sequence
of sixteen smallint columns, you will use fewer binary delta
commands by
representing the change with a single 32-byte substitution. If an
UPDATE
changes just part of a long datum, the delta encoding algorithm
will still
be able to save considerable space. That case arises in many
forms:
changing one word in a long string, changing one element in a long
array,
changing one field of a composite-typed column. Granted, this
makes the
choice of delta encoding algorithm more important.

We may be leaving considerable savings on the table by assuming that
column
boundaries are the only modified-range boundaries worth recognizing.
What is
your willingness to explore general algorithms for choosing such
boundaries?
Such an investigation may, of course, be a dead end.

For this patch I am interested to go with delta encoding approach based on
column boundaries.

However I shall try to do it separately and if it gives positive results
then I will share with hackers.
I will try with VCDiff once or let me know if you have any other algorithm
in mind.
The other positive point I am seeing for exploring some standard diff
algorithm is, incase it gives positive results, we can even try to explore
for some standard compression algorithm for Full_Page_Writes as well to
reduce WAL further.

If you conclude that finding sub-column similarity is not worthwhile, at
least
teach your algorithm to aggregate runs of changing or unchanging columns
into
fewer delta instructions. If a table contains twenty unchanging bool
columns,
you currently use at least 80 bytes to encode that fact. By treating
the run
of columns as a unit for delta encoding purposes, you could encode it in
23
bytes.

Do you mean to say handle for non-continuous unchanged columns?
I believe for continuous unchanged columns its already handled until there
are any alignment changes. Example

create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7
bool,
f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13 bool,
f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19 bool,

f20 bool, f21 bool);

insert into tbl values(10,
't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
't');

update tbl set f1 = 20;

The delta algorithm for the above operation reduced the size of the tuple
from 24 bytes to 12 bytes.

4 bytes - IGN command and LEN
4 bytes - ADD command and LEN
4 bytes - Data block

The same applies for runs of changing columns.

I shall handle for changed columns similar to unchanged columns.

Like Heikki, I'm left wondering why your custom delta encoding is
preferable to an encoding from the literature. Your encoding has
much in
common with VCDIFF, even sharing two exact command names. If a
custom
encoding is the right thing, code comments or a README section
should at
least discuss the advantages over an established alternative.

Reflecting on those comments, I'm no longer too concerned about your use
of a
novel format to encode the delta. For example, VCDIFF is designed to be
flexible and self-describing; we don't need that. But I would still
review it
for useful ideas when designing the encoding for PostgreSQL.

Idle thought: it might pay off to use 1-byte sizes and offsets
most of the
time. Tuples shorter than 256 bytes are common; for longer
tuples, we can
afford wider offsets.

I still think this deserves attention. You currently need two bits to
select
the delta command. Here are a few strategies to consider:

1) Store the commands in groups of eight. Each group starts with two
bytes
representing the eight commands, followed by the corresponding 2-byte
lengths
and ADD data blocks. The last group may end early.

2) Add extra delta commands HEAP_UPDATE_WAL_OPT_COPY16, _ADD16, _IGN16.
Now
each command specifier needs three bits. Again store commands in groups
of
eight, this time three bytes per group header. Expect a 1-byte length
for the
original commands and a 2-byte length for _*16 commands.

2) Store each 2-bit command with a 14-bit length in a uint16. We would
need
to do something grotty for --with-blocksize=32, where a 14-bit length is
not
always enough.

I shall also refer VCDiff format and based on you suggestion, I shall modify
the encoding algorithm.

I mentioned that heap_delta_{encode,decode} should have essentially-
inverse
argument lists. That remains unchanged in this version.

Review the comments added and moved in your patch; some have become
obsolete.
At least one is missing detail I requested in the previous review.

+ /*
+  * get_tuple_info - Gets the tuple offset and value.
+  *
+  * calculates the attribute value and offset, where the attribute

ends in the

+ * tuple based on the attribute number and previous fetched

attribute info.

+  *
+  * offset (I/P and O/P variable) - Input as end of previous

attribute offset

+ * and incase if it is a first attribute then it's

value

is zero.

+  *		Output as end of the current attribute in the tuple.
+  * usecacheoff (I/P and O/P variable) - Attribute cacheoff can be

used or not.

+  */
+ static void
+ get_tuple_info(Form_pg_attribute *att, HeapTuple tuple, bits8 *bp,
+ 			   bool hasnulls, int attnum, Datum *value, uint16

*offset,

+ bool *usecacheoff)

This name is too generic and not particularly tied to what the function
does.
As far as I can see, this is a variant of heap_getattr() that also
returns the
offset and has some redundant arguments. I suggest
heap_getattr_with_offset()
for a name and making its argument list more like heap_getattr().

+ 	if (wal_offset < heaptup->t_len)
+ 	{
+ 		wal_tup->t_len = wal_offset;
+ 		wal_tup->t_self = heaptup->t_self;
+ 		wal_tup->t_tableOid = heaptup->t_tableOid;
+
+ 		return true;
+ 	}

The passed-in wal_tup happens to always have MaxHeapTupleSize of
storage.
However, heap_delta_encode() cannot verify that and does not document it
as an
precondition. Furthermore, it makes no effort to avoid overrunning the
buffer
before finally checking the accumulated length here. The encoded data
could
have grown well beyond MaxHeapTupleSize.

+
+ 	memcpy(wal_tup, heaptup, sizeof(HeapTuple));
+ 	return false;

You would need heap_copytuple_with_tuple() here, but perhaps just
specify the
contents of wal_tup as undefined when heap_delta_encode() returns false.

+ }

+ #define MinHeapTupleSizeForDeltaUpdate 64

This is now unused.

I shall handle review comments in new version.

Overall, there's still plenty to do before this patch is ready. I'm
marking
it Returned with Feedback. I look forward to the benefits it will bring
when
finished.

Thank you for the review.

With Regards,
Amit Kapila.

#12Noah Misch
noah@leadboat.com
In reply to: Amit kapila (#8)
Re: Performance Improvement by reducing WAL for Update Operation

On Sat, Oct 27, 2012 at 04:57:46PM +0530, Amit Kapila wrote:

On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:

Could you elaborate on your reason for continuing to treat TOAST as a
special
case? As best I recall, the only reason to do so before was the fact
that
TOAST can change the physical representation of a column even when
executor
did not change its logical content. Since you're no longer relying on
the
executor's opinion of what changed, a TOASTed value is not special.

I thought for initial version of patch, without this change, patch will have
less impact and less test.

Not that I'm aware. If you still think so, please explain.

For this patch I am interested to go with delta encoding approach based on
column boundaries.

Fair enough.

If you conclude that finding sub-column similarity is not worthwhile, at
least
teach your algorithm to aggregate runs of changing or unchanging columns
into
fewer delta instructions. If a table contains twenty unchanging bool
columns,
you currently use at least 80 bytes to encode that fact. By treating
the run
of columns as a unit for delta encoding purposes, you could encode it in
23
bytes.

Do you mean to say handle for non-continuous unchanged columns?

My statement above was a mess.

I believe for continuous unchanged columns its already handled until there
are any alignment changes. Example

create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7
bool,
f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13 bool,
f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19 bool,

f20 bool, f21 bool);

insert into tbl values(10,
't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
't');

update tbl set f1 = 20;

The delta algorithm for the above operation reduced the size of the tuple
from 24 bytes to 12 bytes.

4 bytes - IGN command and LEN
4 bytes - ADD command and LEN
4 bytes - Data block

I now see that this case is already handled. Sorry for the noise.
Incidentally, I tried this variant:

create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7 bool,
f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13 bool,
f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19 bool,
f20 bool, f21 bool, f22 int, f23 int);
insert into tbl values(1,
't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
't', 2, 3);
update tbl set f1 = 2, f22 = 4, f23 = 6;

It yielded an erroneous delta: IGN 4, ADD 4, COPY 24, IGN 4, ADD 4, COPY 28,
IGN 4, ADD 4. (The delta happens to be longer than the data and goes unused).

nm

#13Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#11)
Re: Performance Improvement by reducing WAL for Update Operation

On 27.10.2012 14:27, Amit Kapila wrote:

On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:

In my previous review, I said:

Given [not relying on the executor to know which columns changed],
why not
treat the tuple as an opaque series of bytes and not worry about
datum
boundaries? When several narrow columns change together, say a
sequence
of sixteen smallint columns, you will use fewer binary delta
commands by
representing the change with a single 32-byte substitution. If an
UPDATE
changes just part of a long datum, the delta encoding algorithm
will still
be able to save considerable space. That case arises in many
forms:
changing one word in a long string, changing one element in a long
array,
changing one field of a composite-typed column. Granted, this
makes the
choice of delta encoding algorithm more important.

We may be leaving considerable savings on the table by assuming that
column
boundaries are the only modified-range boundaries worth recognizing.
What is
your willingness to explore general algorithms for choosing such
boundaries?
Such an investigation may, of course, be a dead end.

For this patch I am interested to go with delta encoding approach based on
column boundaries.

However I shall try to do it separately and if it gives positive results
then I will share with hackers.
I will try with VCDiff once or let me know if you have any other algorithm
in mind.

One idea is to use the LZ format in the WAL record, but use your
memcmp() code to construct it. I believe the slow part in LZ compression
is in trying to locate matches in the "history", so if you just replace
that with your code that's aware of the column boundaries and uses
simple memcmp() to detect what parts changed, you could create LZ
compressed output just as quickly as the custom encoded format. It would
leave the door open for making the encoding smarter or to do actual
compression in the future, without changing the format and the code to
decode it.

- Heikki

#14Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#13)
Re: Performance Improvement by reducing WAL for Update Operation

On Sunday, October 28, 2012 12:28 AM Heikki Linnakangas wrote:

On 27.10.2012 14:27, Amit Kapila wrote:

On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:

In my previous review, I said:

Given [not relying on the executor to know which columns changed],
why not

For this patch I am interested to go with delta encoding approach

based on

column boundaries.

However I shall try to do it separately and if it gives positive

results

then I will share with hackers.
I will try with VCDiff once or let me know if you have any other

algorithm

in mind.

One idea is to use the LZ format in the WAL record, but use your
memcmp() code to construct it. I believe the slow part in LZ compression
is in trying to locate matches in the "history", so if you just replace
that with your code that's aware of the column boundaries and uses
simple memcmp() to detect what parts changed, you could create LZ
compressed output just as quickly as the custom encoded format. It would
leave the door open for making the encoding smarter or to do actual
compression in the future, without changing the format and the code to
decode it.

This is good idea. I shall try it.

In the existing algorithm for storing the new data which is not present in
the history, it needs 1 control byte for
every 8 bytes of new data which can increase the size of the compressed
output as compare to our delta encoding approach.

Shall we modify the LZ Algorithm little bit, so that it can work best for
our case:

Approach-1
---------------
Is it possible to increase the control data from 1 bit to 2 bits [0 - new
data, 1 - pick from history based on OFFSET-LENGTH, 2 - Length and new data]
The new bit value (2) is to handle the new field data as a continuous stream
of data, instead of treating every byte as a new data.

Approach-2
---------------
Use only one bit for control data [0 - Length and new data, 1 - pick from
history based on OFFSET-LENGTH]
The modified bit value (0) is to handle the new field data as a continuous
stream of data, instead of treating every byte as a new data.

With Regards,
Amit Kapila.

#15Amit Kapila
amit.kapila@huawei.com
In reply to: Noah Misch (#12)
Re: Performance Improvement by reducing WAL for Update Operation

On Saturday, October 27, 2012 11:34 PM Noah Misch

On Sat, Oct 27, 2012 at 04:57:46PM +0530, Amit Kapila wrote:

On Saturday, October 27, 2012 4:03 AM Noah Misch wrote:

Could you elaborate on your reason for continuing to treat TOAST as

a

special
case? As best I recall, the only reason to do so before was the

fact

that
TOAST can change the physical representation of a column even when
executor
did not change its logical content. Since you're no longer relying

on

the
executor's opinion of what changed, a TOASTed value is not special.

I thought for initial version of patch, without this change, patch

will have

less impact and less test.

Not that I'm aware. If you still think so, please explain.

For this patch I am interested to go with delta encoding approach

based on

column boundaries.

Fair enough.

If you conclude that finding sub-column similarity is not

worthwhile, at

least
teach your algorithm to aggregate runs of changing or unchanging

columns

into
fewer delta instructions. If a table contains twenty unchanging

bool

columns,
you currently use at least 80 bytes to encode that fact. By

treating

the run
of columns as a unit for delta encoding purposes, you could encode

it in

23
bytes.

Do you mean to say handle for non-continuous unchanged columns?

My statement above was a mess.

I believe for continuous unchanged columns its already handled until

there

are any alignment changes. Example

create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool,

f7

bool,
f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13

bool,

f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19

bool,

f20 bool, f21 bool);

insert into tbl values(10,

't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
't',

't');

update tbl set f1 = 20;

The delta algorithm for the above operation reduced the size of the

tuple

from 24 bytes to 12 bytes.

4 bytes - IGN command and LEN
4 bytes - ADD command and LEN
4 bytes - Data block

I now see that this case is already handled. Sorry for the noise.
Incidentally, I tried this variant:

create table tbl(f1 int, f2 bool, f3 bool, f4 bool, f5 bool, f6 bool, f7
bool,
f8 bool, f9 bool, f10 bool, f11 bool, f12 bool, f13
bool,
f14 bool, f15 bool, f16 bool, f17 bool, f18 bool, f19
bool,
f20 bool, f21 bool, f22 int, f23 int);
insert into tbl values(1,
't','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t','t',
't',
't', 2, 3);
update tbl set f1 = 2, f22 = 4, f23 = 6;

It yielded an erroneous delta: IGN 4, ADD 4, COPY 24, IGN 4, ADD 4, COPY
28,
IGN 4, ADD 4. (The delta happens to be longer than the data and goes
unused).

I think with new algorithm based on inputs by you this case will be handled
in much better way.
I am planning to try 2 approaches:
1. try to Use LZ compression in the manner suggested by Heikki as if it
works, it can be simpler.
2. devise new algorithm based on your suggestions and referring LZ/VCdiff
algorithms.

With Regards,
Amit Kapila.