Re: Performance Improvement by reducing WAL for Update Operation

Started by Amit Kapilaabout 13 years ago128 messageshackers
Jump to latest
#1Amit Kapila
amit.kapila16@gmail.com

On Friday, January 11, 2013 11:12 PM Simon Riggs wrote:
On 11 January 2013 17:30, Amit kapila <amit(dot)kapila(at)huawei(dot)com> wrote:

On Friday, January 11, 2013 7:59 PM Alvaro Herrera wrote:
Simon Riggs wrote:

On 28 December 2012 10:21, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

I was also worried about the high variance in the results. Those
averages look rather meaningless. Which would be okay, I think, because
it'd mean that performance-wise the patch is a wash,

For larger tuple sizes (>1000 && < 1800), the performance gain will be good.
Please refer performance results by me and Kyotaro-san in below links:

http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BEAAE32(at)szxeml509-mbx&lt;http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C383BEAAE32%28at%29szxeml509-mbx&gt;
http://archives.postgresql.org/message-id/20121228(dot)170748(dot)90887322(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp&lt;http://archives.postgresql.org/message-id/20121228%28dot%29170748%28dot%2990887322%28dot%29horiguchi%28dot%29kyotaro%28at%29lab%28dot%29ntt%28dot%29co%28dot%29jp&gt;

AFAICS your tests are badly variable, but as Alvaro says, they aren't
accurate enough to tell there's a regression.

By running performance scenario in suse 11 board, the readings are not varying much except 8 threads, as i feel my board is a 4 core machine.

Performance readings are attached for original, 256, 512, 1000 and 1800 size of records.

Conclusions from the readings:

1. With orignal pgbench there is a max 9% WAL reduction with not much performance difference.
2. With 250 record pgbench there is a max wal reduction of 30% with not much performance difference.
3. With 500 and above record size in pgbench there is an improvement in the performance and wal reduction both.

If the record size increases there is a gain in performance and wal size is reduced as well.

With Regards,

Amit Kapila.

Attachments:

pgbench_suse11.htmtext/html; name=pgbench_suse11.htmDownload
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#1)

On 11 January 2013 15:57, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

I've moved this to the next CF. I'm planning to review this one first.

Thank you.

Just reviewing the patch now, making more sense with comments added.

Making more sense, but not yet making complete sense.

I'd like you to revisit the patch comments since some of them are completely unreadable.

I have modified most of the comments in code.

The changes in attached patch are as below:

1. Introduced Encoded WAL Tuple (EWT) to refer to delta encoded tuple for update operation.

It can rename to one of below:

a. WAL Encoded Tuple (WET)

b. Delta Encoded WAL Tuple (DEWT)

c. Delta WAL Encoded Tuple (DWET)

d. any others?

2. I have kept the wording related to compression in modified docs, but i have tried to copy parts completely.

IMO this is required as there are some changes w.r.t LZ compression like for New Data.

3. There is small coding change as it has been overwritten by one of my previous patch patches.

Calculation of approximate length for encoded wal tuple.

Previous Patch:

if ((bp + (2 * new_tup_bitmaplen)) - bstart >= result_max)

New Patch:

if ((bp + (2 + new_tup_bitmaplen)) - bstart >= result_max)

The previous patch calculation was valid if we could have exactly used LZ format.

With Regards,

Amit Kapila.

Attachments:

wal_update_changes_v8.patchapplication/octet-stream; name=wal_update_changes_v8.patchDownload+1320-347
#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#2)

On Monday, January 21, 2013 9:32 PM Amit kapila wrote:
On 11 January 2013 15:57, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

I've moved this to the next CF. I'm planning to review this one first.

Thank you.

Just reviewing the patch now, making more sense with comments added.

Making more sense, but not yet making complete sense.

I'd like you to revisit the patch comments since some of them are

completely unreadable.
 

I have modified most of the comments in code.
The changes in attached patch are as below:

 

Rebased the patch as per HEAD.
 
With Regards,
Amit Kapila.

Attachments:

wal_update_changes_v9.patchapplication/octet-stream; name=wal_update_changes_v9.patchDownload+1310-338
#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#3)

On 28.01.2013 15:39, Amit Kapila wrote:

Rebased the patch as per HEAD.

I don't like the way heap_delta_encode has intimate knowledge of how the
lz compression works. It feels like a violent punch through the
abstraction layers.

Ideally, you would just pass the old and new tuple to pglz as char *,
and pglz code would find the common parts. But I guess that's too slow,
as that's what I originally suggested and you rejected that approach.
But even if that's not possible on performance grounds, we don't need to
completely blow up the abstraction. pglz can still do the encoding - the
caller just needs to pass it the attribute boundaries to consider for
matches, so that it doesn't need to scan them byte by byte.

I came up with the attached patch. I wrote it to demonstrate the API,
I'm not 100% sure the result after decoding is correct.

- Heikki

Attachments:

wal_update_pglz_with_history-heikki.patchtext/x-diff; name=wal_update_pglz_with_history-heikki.patchDownload+620-38
#5Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#4)

On Tuesday, January 29, 2013 2:53 AM Heikki Linnakangas wrote:

On 28.01.2013 15:39, Amit Kapila wrote:

Rebased the patch as per HEAD.

I don't like the way heap_delta_encode has intimate knowledge of how
the lz compression works. It feels like a violent punch through the
abstraction layers.

Ideally, you would just pass the old and new tuple to pglz as char *,
and pglz code would find the common parts. But I guess that's too slow,
as that's what I originally suggested and you rejected that approach.
But even if that's not possible on performance grounds, we don't need
to completely blow up the abstraction. pglz can still do the encoding -
the caller just needs to pass it the attribute boundaries to consider
for matches, so that it doesn't need to scan them byte by byte.

I came up with the attached patch. I wrote it to demonstrate the API,
I'm not 100% sure the result after decoding is correct.

I have checked the patch code, found few problems.

1. History will be old tuple, in that case below call needs to be changed
/* return pglz_compress_with_history((char *) oldtup->t_data,
oldtup->t_len,

(char *) newtup->t_data, newtup->t_len,

offsets, noffsets, (PGLZ_Header *) encdata,

&strategy);*/
return pglz_compress_with_history((char *) newtup->t_data,
newtup->t_len,

(char *) oldtup->t_data, oldtup->t_len,

offsets, noffsets, (PGLZ_Header *) encdata,

&strategy);
2. The offset array is beginning of each column offset. In that case below
needs to be changed.

offsets[noffsets++] = off;

off = att_addlength_pointer(off, thisatt->attlen, tp + off);

if (thisatt->attlen <= 0)
slow = true; /* can't use attcacheoff
anymore */

// offsets[noffsets++] = off;
}

Apart from this, some of the test cases are failing which I need to check.

I have debugged the new code, it appears to me that this will not be as
efficient as the current approach of patch.
It needs to build hash table for history reference and comparison which can
add overhead as compare to existing approach. I am taking the Performance
and WAL Reduction data.

Can there be another way with which current patch code can be made better,
so that we don't need to change the encoding approach, as I am having
feeling that this might not be performance wise equally good.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#5)

On 29.01.2013 11:58, Amit Kapila wrote:

Can there be another way with which current patch code can be made better,
so that we don't need to change the encoding approach, as I am having
feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch more
like yours in behavior by also passing an array of offsets in the new
tuple to check, and only checking for matches as those offsets.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#6)

On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:

On 29.01.2013 11:58, Amit Kapila wrote:

Can there be another way with which current patch code can be made

better,

so that we don't need to change the encoding approach, as I am having
feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch more
like yours in behavior by also passing an array of offsets in the new
tuple to check, and only checking for matches as those offsets.

I think it makes sense, because if we have offsets of both new and old
tuple, we
can internally use memcmp to compare columns and use same algorithm for
encoding.
I will change the patch according to this suggestion.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#7)

On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:

On 29.01.2013 11:58, Amit Kapila wrote:

Can there be another way with which current patch code can be made

better,

so that we don't need to change the encoding approach, as I am

having

feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch more
like yours in behavior by also passing an array of offsets in the new
tuple to check, and only checking for matches as those offsets.

I think it makes sense, because if we have offsets of both new and old
tuple, we
can internally use memcmp to compare columns and use same algorithm for
encoding.
I will change the patch according to this suggestion.

I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed bitmaplength
also, as we need
to copy the bitmap of new tuple as it is into Encoded WAL Tuple.

Please see if such API design is okay?

I shall update the README and send the performance/WAL Reduction data for
modified patch tomorrow.

With Regards,
Amit Kapila.

Attachments:

wal_update_changes_v10.patchapplication/octet-stream; name=wal_update_changes_v10.patchDownload+745-121
#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#8)

On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:

On 29.01.2013 11:58, Amit Kapila wrote:

Can there be another way with which current patch code can be

made

better,

so that we don't need to change the encoding approach, as I am

having

feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch

more

like yours in behavior by also passing an array of offsets in the
new tuple to check, and only checking for matches as those offsets.

I think it makes sense, because if we have offsets of both new and

old

tuple, we can internally use memcmp to compare columns and use same
algorithm for encoding.
I will change the patch according to this suggestion.

I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed
bitmaplength also, as we need to copy the bitmap of new tuple as it is
into Encoded WAL Tuple.

Please see if such API design is okay?

I shall update the README and send the performance/WAL Reduction data
for modified patch tomorrow.

Updated patch including comments and README is attached with this mail.
This patch contain exactly same design behavior as per previous.
It takes care of API design suggestion of Heikki.

The performance data is similar, as it is not complete, I shall send that
tomorrow.

With Regards,
Amit Kapila.

Attachments:

wal_update_changes_v10.patchapplication/octet-stream; name=wal_update_changes_v10.patchDownload+907-144
#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#9)

On Thursday, January 31, 2013 6:44 PM Amit Kapila wrote:

On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:

On 29.01.2013 11:58, Amit Kapila wrote:

Can there be another way with which current patch code can be

made

better,

so that we don't need to change the encoding approach, as I am

having

feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know the
internals of pglz compression. You could probably make my patch

more

like yours in behavior by also passing an array of offsets in the
new tuple to check, and only checking for matches as those

offsets.

I think it makes sense, because if we have offsets of both new and

old

tuple, we can internally use memcmp to compare columns and use same
algorithm for encoding.
I will change the patch according to this suggestion.

I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed
bitmaplength also, as we need to copy the bitmap of new tuple as it

is

into Encoded WAL Tuple.

Please see if such API design is okay?

I shall update the README and send the performance/WAL Reduction data
for modified patch tomorrow.

Updated patch including comments and README is attached with this mail.
This patch contain exactly same design behavior as per previous.
It takes care of API design suggestion of Heikki.

The performance data is similar, as it is not complete, I shall send
that tomorrow.

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):

1. With orignal pgbench there is a max 7% WAL reduction with not much
performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with not much
performance difference.
3. With 500 and above record size in pgbench there is an improvement in the
performance and wal reduction both.

If the record size increases there is a gain in performance and wal size is
reduced as well.

Performance data for synchronous_commit = on is under progress, I shall post
it once it is done.
I am expecting it to be same as previous.

With Regards,
Amit Kapila.

Attachments:

pgbench_wal_lz_mod.htmtext/html; name=pgbench_wal_lz_mod.htmDownload
#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#10)

On Friday, February 01, 2013 6:37 PM Amit Kapila wrote:

On Thursday, January 31, 2013 6:44 PM Amit Kapila wrote:

On Wednesday, January 30, 2013 8:32 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 7:42 PM Amit Kapila wrote:

On Tuesday, January 29, 2013 3:53 PM Heikki Linnakangas wrote:

On 29.01.2013 11:58, Amit Kapila wrote:

Can there be another way with which current patch code can be

made

better,

so that we don't need to change the encoding approach, as I

am

having

feeling that this might not be performance wise equally good.

The point is that I don't want to heap_delta_encode() to know
the internals of pglz compression. You could probably make my
patch

more

like yours in behavior by also passing an array of offsets in
the new tuple to check, and only checking for matches as those

offsets.

I think it makes sense, because if we have offsets of both new

and

old

tuple, we can internally use memcmp to compare columns and use
same algorithm for encoding.
I will change the patch according to this suggestion.

I have modified the patch as per above suggestion.
Apart from passing new and old tuple offsets, I have passed
bitmaplength also, as we need to copy the bitmap of new tuple as it

is

into Encoded WAL Tuple.

Please see if such API design is okay?

I shall update the README and send the performance/WAL Reduction
data for modified patch tomorrow.

Updated patch including comments and README is attached with this

mail.

This patch contain exactly same design behavior as per previous.
It takes care of API design suggestion of Heikki.

The performance data is similar, as it is not complete, I shall send
that tomorrow.

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):

1. With orignal pgbench there is a max 7% WAL reduction with not much
performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with not
much performance difference.
3. With 500 and above record size in pgbench there is an improvement in
the performance and wal reduction both.

If the record size increases there is a gain in performance and wal
size is reduced as well.

Performance data for synchronous_commit = on is under progress, I shall
post it once it is done.
I am expecting it to be same as previous.

Please find the performance readings for synchronous_commit = on.

Each run is taken for 20 min.

Conclusions from the readings with synchronous commit on mode:

1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% with not much
performance difference.
3. With 1800 record size in pgbench there is both an improvement in the
performance (approx 3%) as well as wal reduction (44%).

If the record size increases there is a very good reduction in WAL size.

Please provide your feedback.

With Regards,
Amit Kapila.

Attachments:

pgbench_wal_lz_mod_sync_commit_on_20min.htmtext/html; name=pgbench_wal_lz_mod_sync_commit_on_20min.htmDownload
#12Craig Ringer
craig@2ndquadrant.com
In reply to: Amit Kapila (#11)

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):

1. With orignal pgbench there is a max 7% WAL reduction with not much
performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with not
much performance difference.
3. With 500 and above record size in pgbench there is an improvement in
the performance and wal reduction both.

If the record size increases there is a gain in performance and wal
size is reduced as well.

Performance data for synchronous_commit = on is under progress, I shall
post it once it is done.
I am expecting it to be same as previous.

Please find the performance readings for synchronous_commit = on.

Each run is taken for 20 min.

Conclusions from the readings with synchronous commit on mode:

1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% with not much
performance difference.
3. With 1800 record size in pgbench there is both an improvement in the
performance (approx 3%) as well as wal reduction (44%).

If the record size increases there is a very good reduction in WAL size.

The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Craig Ringer (#12)

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):

1. With orignal pgbench there is a max 7% WAL reduction with not

much

performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with

not

much performance difference.
3. With 500 and above record size in pgbench there is an improvement

in

the performance and wal reduction both.

If the record size increases there is a gain in performance and wal
size is reduced as well.

Performance data for synchronous_commit = on is under progress, I

shall

post it once it is done.
I am expecting it to be same as previous.

Please find the performance readings for synchronous_commit = on.

Each run is taken for 20 min.

Conclusions from the readings with synchronous commit on mode:

1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% with

not much

performance difference.
3. With 1800 record size in pgbench there is both an improvement in

the

performance (approx 3%) as well as wal reduction (44%).

If the record size increases there is a very good reduction in WAL

size.

The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.

For bigger size (~2000) records, it goes into toast, for which we don't do
this optimization.
This optimization is mainly for medium size records.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#13)

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous patch):

1. With orignal pgbench there is a max 7% WAL reduction with not

much

performance difference.
2. With 250 record pgbench there is a max wal reduction of 35% with

not

much performance difference.
3. With 500 and above record size in pgbench there is an improvement

in

the performance and wal reduction both.

If the record size increases there is a gain in performance and wal
size is reduced as well.

Performance data for synchronous_commit = on is under progress, I

shall

post it once it is done.
I am expecting it to be same as previous.

Please find the performance readings for synchronous_commit = on.

Each run is taken for 20 min.

Conclusions from the readings with synchronous commit on mode:

1. With orignal pgbench there is a max 2% WAL reduction with not much
performance difference.
2. With 500 record pgbench there is a max wal reduction of 3% with

not much

performance difference.
3. With 1800 record size in pgbench there is both an improvement in

the

performance (approx 3%) as well as wal reduction (44%).

If the record size increases there is a very good reduction in WAL

size.

The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.

For bigger size (~2000) records, it goes into toast, for which we don't do
this optimization.
This optimization is mainly for medium size records.

I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll begin
with some numbers:

unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 1245525360 | 9.94613695144653
two short fields, one changed | 1245536528 | 10.146910905838
two short fields, both changed | 1245523160 | 11.2332470417023
one short and one long field, no change | 1054926504 | 5.90477800369263
ten tiny fields, all changed | 1411774608 | 13.4536008834839
hundred tiny fields, all changed | 635739680 | 7.57448387145996
hundred tiny fields, half changed | 636930560 | 7.56888699531555
hundred tiny fields, half nulled | 573751120 | 6.68991994857788

Amit's wal_update_changes_v10.patch:

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 1249722112 | 13.0558869838715
two short fields, one changed | 1246145408 | 12.9947438240051
two short fields, both changed | 1245951056 | 13.0262880325317
one short and one long field, no change | 678480664 | 5.70031690597534
ten tiny fields, all changed | 1328873920 | 20.0167419910431
hundred tiny fields, all changed | 638149416 | 14.4236788749695
hundred tiny fields, half changed | 635560504 | 14.8770561218262
hundred tiny fields, half nulled | 558468352 | 16.2437210083008

pglz-with-micro-optimizations-1.patch:

testname | wal_generated |
duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1245519008 | 11.6702048778534
two short fields, one changed | 1245756904 | 11.3233819007874
two short fields, both changed | 1249711088 | 11.6836447715759
one short and one long field, no change | 664741392 | 6.44810795783997
ten tiny fields, all changed | 1328085568 | 13.9679481983185
hundred tiny fields, all changed | 635974088 | 9.15514206886292
hundred tiny fields, half changed | 636309040 | 9.13769292831421
hundred tiny fields, half nulled | 496396448 | 8.77351498603821

In each test, a table is created with a large number of identical rows,
and fillfactor=50. Then a full-table UPDATE is performed, and the UPDATE
is timed. Duration is the time spent in the UPDATE (lower is better),
and wal_generated is the amount of WAL generated by the updates (lower
is better).

The summary is that Amit's patch is a small win in terms of CPU usage,
in the best case where the table has few columns, with one large column
that is not updated. In all other cases it just adds overhead. In terms
of WAL size, you get a big gain in the same best case scenario.

Attached is a different version of this patch, which uses the pglz
algorithm to spot the similarities between the old and new tuple,
instead of having explicit knowledge of where the column boundaries are.
This has the advantage that it will spot similarities, and be able to
compress, in more cases. For example, you can see a reduction in WAL
size in the "hundred tiny fields, half nulled" test case above.

The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default, this
probably just isn't worth it.

The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function, it
goes further than that, and contains some further micro-optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more. One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for speed.

If you could squeeze pglz_delta_encode function to be cheap enough that
we could enable this by default, this would be pretty cool patch. Or at
least, the overhead in the cases that you get no compression needs to be
brought down, to about 2-5 % at most I think. If it can't be done
easily, I feel that this probably needs to be dropped.

PS. I haven't done much testing of WAL redo, so it's quite possible that
the encoding is actually buggy, or that decoding is slow. But I don't
think there's anything so fundamentally wrong that it would affect the
performance results much.

- Heikki

Attachments:

pglz-with-micro-optimizations-1.patchtext/x-diff; name=pglz-with-micro-optimizations-1.patchDownload+755-58
wal-update-testsuite.shapplication/x-sh; name=wal-update-testsuite.shDownload
#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#14)

On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous

patch):

I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll begin
with some numbers:

unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):

testname | wal_generated |
duration

-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1245525360 |
9.94613695144653
two short fields, one changed | 1245536528 |
10.146910905838
two short fields, both changed | 1245523160 |
11.2332470417023
one short and one long field, no change | 1054926504 |
5.90477800369263
ten tiny fields, all changed | 1411774608 |
13.4536008834839
hundred tiny fields, all changed | 635739680 |
7.57448387145996
hundred tiny fields, half changed | 636930560 |
7.56888699531555
hundred tiny fields, half nulled | 573751120 |
6.68991994857788

Amit's wal_update_changes_v10.patch:

testname | wal_generated |
duration

-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1249722112 |
13.0558869838715
two short fields, one changed | 1246145408 |
12.9947438240051
two short fields, both changed | 1245951056 |
13.0262880325317
one short and one long field, no change | 678480664 |
5.70031690597534
ten tiny fields, all changed | 1328873920 |
20.0167419910431
hundred tiny fields, all changed | 638149416 |
14.4236788749695
hundred tiny fields, half changed | 635560504 |
14.8770561218262
hundred tiny fields, half nulled | 558468352 |
16.2437210083008

pglz-with-micro-optimizations-1.patch:

testname | wal_generated |
duration
-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1245519008 |
11.6702048778534
two short fields, one changed | 1245756904 |
11.3233819007874
two short fields, both changed | 1249711088 |
11.6836447715759
one short and one long field, no change | 664741392 |
6.44810795783997
ten tiny fields, all changed | 1328085568 |
13.9679481983185
hundred tiny fields, all changed | 635974088 |
9.15514206886292
hundred tiny fields, half changed | 636309040 |
9.13769292831421
hundred tiny fields, half nulled | 496396448 |
8.77351498603821

For some of the tests, it doesn't even execute main part of
compression/encoding.
The reason is that the length of tuple is less than strategy min length, so
it returns from below check
in function pglz_delta_encode()
if (strategy->match_size_good <= 0 ||
slen < strategy->min_input_size ||
slen > strategy->max_input_size)
return false;

The tests for which it doesn't execute encoding are below:
two short fields, no change
two short fields, one changed
two short fields, both changed
ten tiny fields, all changed

For above cases, the reason of difference in timings for both approaches
with original could be due to the reason that
this check is done after some processing. So I think if we check the length
in log_heap_update, then
there should not be timing difference for above test scenario's. I can check
that once.

This optimization helps only when tuple is of length > 128~200 bytes and
upto 1800 bytes (till it turns to toast), otherwise it could result in
overhead without any major WAL reduction.
Infact I think in one of my initial patch there is a check if length of
tuple is greater than 128 bytes then perform the optimization.

I shall try to run both patches for cases when length of tuple is > 128~200
bytes, as this optimization has benefits in those cases.

In each test, a table is created with a large number of identical rows,
and fillfactor=50. Then a full-table UPDATE is performed, and the
UPDATE is timed. Duration is the time spent in the UPDATE (lower is
better), and wal_generated is the amount of WAL generated by the
updates (lower is better).

The summary is that Amit's patch is a small win in terms of CPU usage,
in the best case where the table has few columns, with one large column
that is not updated. In all other cases it just adds overhead. In terms
of WAL size, you get a big gain in the same best case scenario.

Attached is a different version of this patch, which uses the pglz
algorithm to spot the similarities between the old and new tuple,
instead of having explicit knowledge of where the column boundaries
are.
This has the advantage that it will spot similarities, and be able to
compress, in more cases. For example, you can see a reduction in WAL
size in the "hundred tiny fields, half nulled" test case above.

The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default,
this probably just isn't worth it.

As I mentioned, for smaller tuples it can be overhead without any major
benefit of WAL reduction,
so I think before doing encoding it should ensure that tuple length is
greater than some threshold length.
Yes, it can miss some cases like your test has shown for (hundred tiny
fields, half nulled),
but we might be able to safely enable it for default.

The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function,
it goes further than that, and contains some further micro-
optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more.

One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for
speed.

Do you mean to say roll for 10 times and then call pglz_hist_add_no_recycle
and then same
before pglz_find_match?

I shall try doing this for the tests.

If you could squeeze pglz_delta_encode function to be cheap enough that
we could enable this by default, this would be pretty cool patch. Or at
least, the overhead in the cases that you get no compression needs to
be brought down, to about 2-5 % at most I think. If it can't be done
easily, I feel that this probably needs to be dropped.

Agreed, though it gives benefit for some of the cases, but it should not
degrade much
for any of other cases.

One more thing that any compression technique will have some overhead, so it
should be
used optimally rather then in every case. So in that regards, I think we
should do this
optimization only when it has better chance of win (like based on length of
tuple or some other criteria
where WAL tuple can be logged as-is). What is your opinion?

PS. I haven't done much testing of WAL redo, so it's quite possible
that the encoding is actually buggy, or that decoding is slow. But I
don't think there's anything so fundamentally wrong that it would
affect the performance results much.

I also don't think it will have any problem, but I can run some test to
verify the same.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#14)

On 2013-03-05 23:26:59 +0200, Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

The stats look fairly sane. I'm a little concerned about the apparent
trend of falling TPS in the patched vs original tests for the 1-client
test as record size increases, but it's only 0.0%->0.2%->0.4%, and the
0.4% case made other config changes too. Nonetheless, it might be wise
to check with really big records and see if the trend continues.

For bigger size (~2000) records, it goes into toast, for which we don't do
this optimization.
This optimization is mainly for medium size records.

I've been doing investigating the pglz option further, and doing performance
comparisons of the pglz approach and this patch. I'll begin with some
numbers:

unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 1245525360 | 9.94613695144653
two short fields, one changed | 1245536528 | 10.146910905838
two short fields, both changed | 1245523160 | 11.2332470417023
one short and one long field, no change | 1054926504 | 5.90477800369263
ten tiny fields, all changed | 1411774608 | 13.4536008834839
hundred tiny fields, all changed | 635739680 | 7.57448387145996
hundred tiny fields, half changed | 636930560 | 7.56888699531555
hundred tiny fields, half nulled | 573751120 | 6.68991994857788

Amit's wal_update_changes_v10.patch:

testname | wal_generated | duration

-----------------------------------------+---------------+------------------
two short fields, no change | 1249722112 | 13.0558869838715
two short fields, one changed | 1246145408 | 12.9947438240051
two short fields, both changed | 1245951056 | 13.0262880325317
one short and one long field, no change | 678480664 | 5.70031690597534
ten tiny fields, all changed | 1328873920 | 20.0167419910431
hundred tiny fields, all changed | 638149416 | 14.4236788749695
hundred tiny fields, half changed | 635560504 | 14.8770561218262
hundred tiny fields, half nulled | 558468352 | 16.2437210083008

pglz-with-micro-optimizations-1.patch:

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1245519008 | 11.6702048778534
two short fields, one changed | 1245756904 | 11.3233819007874
two short fields, both changed | 1249711088 | 11.6836447715759
one short and one long field, no change | 664741392 | 6.44810795783997
ten tiny fields, all changed | 1328085568 | 13.9679481983185
hundred tiny fields, all changed | 635974088 | 9.15514206886292
hundred tiny fields, half changed | 636309040 | 9.13769292831421
hundred tiny fields, half nulled | 496396448 | 8.77351498603821

In each test, a table is created with a large number of identical rows, and
fillfactor=50. Then a full-table UPDATE is performed, and the UPDATE is
timed. Duration is the time spent in the UPDATE (lower is better), and
wal_generated is the amount of WAL generated by the updates (lower is
better).

The summary is that Amit's patch is a small win in terms of CPU usage, in
the best case where the table has few columns, with one large column that is
not updated. In all other cases it just adds overhead. In terms of WAL size,
you get a big gain in the same best case scenario.

Attached is a different version of this patch, which uses the pglz algorithm
to spot the similarities between the old and new tuple, instead of having
explicit knowledge of where the column boundaries are. This has the
advantage that it will spot similarities, and be able to compress, in more
cases. For example, you can see a reduction in WAL size in the "hundred tiny
fields, half nulled" test case above.

The attached patch also just adds overhead in most cases, but the overhead
is much smaller in the worst case. I think that's the right tradeoff here -
we want to avoid scenarios where performance falls off the cliff. That said,
if you usually just get a slowdown, we certainly can't make this the
default, and if we can't turn it on by default, this probably just isn't
worth it.

The attached patch contains the variable-hash-size changes I posted in the
"Optimizing pglz compressor". But in the delta encoding function, it goes
further than that, and contains some further micro-optimizations: the hash
is calculated in a rolling fashion, and it uses a specialized version of the
pglz_hist_add macro that knows that the input can't exceed 4096 bytes. Those
changes shaved off some cycles, but you could probably do more. One idea is
to only add every 10 bytes or so to the history lookup table; that would
sacrifice some compressibility for speed.

If you could squeeze pglz_delta_encode function to be cheap enough that we
could enable this by default, this would be pretty cool patch. Or at least,
the overhead in the cases that you get no compression needs to be brought
down, to about 2-5 % at most I think. If it can't be done easily, I feel
that this probably needs to be dropped.

While this is exciting stuff - and I find Heikki's approach more
interesting and applicable to more cases - I think this is clearly not
9.3 material anymore. There are loads of tradeoffs here which requires
substantial amount of benchmarking and its not the kind of change that
can be backed out easily during 9.3's lifecycle.

And I have to say I find 2-5% performance overhead too high...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#14)

On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous

patch):

I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll begin
with some numbers:

unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):

testname | wal_generated |
duration

-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1245525360 |
9.94613695144653
two short fields, one changed | 1245536528 |
10.146910905838
two short fields, both changed | 1245523160 |
11.2332470417023
one short and one long field, no change | 1054926504 |
5.90477800369263
ten tiny fields, all changed | 1411774608 |
13.4536008834839
hundred tiny fields, all changed | 635739680 |
7.57448387145996
hundred tiny fields, half changed | 636930560 |
7.56888699531555
hundred tiny fields, half nulled | 573751120 |
6.68991994857788

Amit's wal_update_changes_v10.patch:

testname | wal_generated |
duration

-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1249722112 |
13.0558869838715
two short fields, one changed | 1246145408 |
12.9947438240051
two short fields, both changed | 1245951056 |
13.0262880325317
one short and one long field, no change | 678480664 |
5.70031690597534
ten tiny fields, all changed | 1328873920 |
20.0167419910431
hundred tiny fields, all changed | 638149416 |
14.4236788749695
hundred tiny fields, half changed | 635560504 |
14.8770561218262
hundred tiny fields, half nulled | 558468352 |
16.2437210083008

pglz-with-micro-optimizations-1.patch:

testname | wal_generated |
duration
-----------------------------------------+---------------+-------------
-
-----------------------------------------+---------------+----
two short fields, no change | 1245519008 |
11.6702048778534
two short fields, one changed | 1245756904 |
11.3233819007874
two short fields, both changed | 1249711088 |
11.6836447715759
one short and one long field, no change | 664741392 |
6.44810795783997
ten tiny fields, all changed | 1328085568 |
13.9679481983185
hundred tiny fields, all changed | 635974088 |
9.15514206886292
hundred tiny fields, half changed | 636309040 |
9.13769292831421
hundred tiny fields, half nulled | 496396448 |
8.77351498603821

In each test, a table is created with a large number of identical rows,
and fillfactor=50. Then a full-table UPDATE is performed, and the
UPDATE is timed. Duration is the time spent in the UPDATE (lower is
better), and wal_generated is the amount of WAL generated by the
updates (lower is better).

Based on your patch, I have tried some more optimizations:

Fixed bug in your patch (pglz-with-micro-optimizations-2):
1. There were some problems in recovery due to wrong length of oldtuple
passed in decode which I have corrected.

Approach -1 (pglz-with-micro-optimizations-2_roll10_32)
1. Move strategy min length (32) check in log_heap_update
2. Rolling 10 for hash as suggested by you is added.

Approach -2 (pglz-with-micro-optimizations-2_roll10_32_1hashkey)
1. This is done on top of Approach-1 changes
2. Used 1 byte data as the hash key.

Approach-3
(pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal)
1. This is done on top of Approach-1 and Approach-2 changes
2. Instead of doing copy of literal byte when it founds as non match with
history, do all in a batch.

Data for all above approaches is in attached file "test_readings" (Apart
from your tests, I have added one more test " hundred tiny fields, first 10
changed")

Summary -
After changes of Approach-1, CPU utilization for all except 2 tests
("hundred tiny fields, all changed",
"hundred tiny fields, half changed") is either same or less. The best case
CPU utilization has decreased (which is better), but WAL reduction has
little bit increased (which is as per expectation due 10 consecutive
rollup's).

Approach-2 modifications was done to see if there is any overhead of hash
calculation.
Approach-2 & Approach-3 doesn't result into any improvements.

I have investigated the reason for CPU utilization for 2 tests and the
reason is that there is nothing to compress in the new tuple and that
information it will come to know only after it processes 75% (compression
ratio) of tuple bytes.
I think any compression algorithm will have this drawback that if data is
not compressible, it can consume time inspite
of the fact that it will not be able to compress the data.
I think most updates will update some part of tuple which will always yield
positive results.

Apart from above tests, I have run your patch against my old tests, it
yields quite positive results,
WAL Reduction is more as compare to my patch and CPU utilization is almost
similar or my patch is slightly better.
The results are in attached file "pgbench_pg_lz_mod"

The above all data is for synchronous_commit = off. I can collect the data
for synchronous_commit = on and
Performance of recovery.

Any further suggestions?

With Regards,
Amit Kapila.

Attachments:

pgbench_pg_lz_mod.htmtext/html; name=pgbench_pg_lz_mod.htmDownload
pglz-with-micro-optimizations-2.patchapplication/octet-stream; name=pglz-with-micro-optimizations-2.patchDownload+756-58
pglz-with-micro-optimizations-2_roll10_32.patchapplication/octet-stream; name=pglz-with-micro-optimizations-2_roll10_32.patchDownload+784-61
pglz-with-micro-optimizations-2_roll10_32_1hashkey.patchapplication/octet-stream; name=pglz-with-micro-optimizations-2_roll10_32_1hashkey.patchDownload+761-61
pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal.patchapplication/octet-stream; name=pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal.patchDownload+759-61
test_readings.txttext/plain; name=test_readings.txtDownload
#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#17)

On Friday, March 08, 2013 9:22 PM Amit Kapila wrote:

On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous

patch):

I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll
begin with some numbers:

Based on your patch, I have tried some more optimizations:

Fixed bug in your patch (pglz-with-micro-optimizations-2):
1. There were some problems in recovery due to wrong length of oldtuple
passed in decode which I have corrected.

Approach -1 (pglz-with-micro-optimizations-2_roll10_32)
1. Move strategy min length (32) check in log_heap_update 2. Rolling 10
for hash as suggested by you is added.

Approach -2 (pglz-with-micro-optimizations-2_roll10_32_1hashkey)
1. This is done on top of Approach-1 changes 2. Used 1 byte data as the
hash key.

Approach-3
(pglz-with-micro-optimizations-2_roll10_32_1hashkey_batch_literal)
1. This is done on top of Approach-1 and Approach-2 changes 2. Instead
of doing copy of literal byte when it founds as non match with history,
do all in a batch.

Data for all above approaches is in attached file "test_readings"
(Apart from your tests, I have added one more test " hundred tiny
fields, first 10
changed")

Summary -
After changes of Approach-1, CPU utilization for all except 2 tests
("hundred tiny fields, all changed", "hundred tiny fields, half
changed") is either same or less. The best case CPU utilization has
decreased (which is better), but WAL reduction has little bit increased
(which is as per expectation due 10 consecutive rollup's).

Approach-2 modifications was done to see if there is any overhead of
hash calculation.
Approach-2 & Approach-3 doesn't result into any improvements.

I have investigated the reason for CPU utilization for 2 tests and the
reason is that there is nothing to compress in the new tuple and that
information it will come to know only after it processes 75%
(compression
ratio) of tuple bytes.
I think any compression algorithm will have this drawback that if data
is not compressible, it can consume time inspite of the fact that it
will not be able to compress the data.
I think most updates will update some part of tuple which will always
yield positive results.

Apart from above tests, I have run your patch against my old tests, it
yields quite positive results, WAL Reduction is more as compare to my
patch and CPU utilization is almost similar or my patch is slightly
better.
The results are in attached file "pgbench_pg_lz_mod"

The above all data is for synchronous_commit = off. I can collect the
data for synchronous_commit = on and Performance of recovery.

Data for synchronous_commit = on is as follows:

Find the data for heikki's test in file "test_readings_on.txt"

Result and observation is same as for synchronous_commit =off. In short,
Approach-1
as mentioned in above mail seems to be best.

Find the data for pg_bench based test's used in my previous tests in
"pgbench_pg_lz_mod_sync_commit_on.htm"
This has been done for Heikki's original patch and Approach-1.
It shows that there is very minor cpu dip (0.1%) in some cases and WAL
Reduction of (2~3%).
WAL reduction is not much as operations performed are less.

Recovery Performance
----------------------
pgbench org:

./pgbench -i -s 75 -F 80 postgres
./pgbench -c 4 -j 4 -T 600 postgres

pgbench 1800(rec size=1800):

./pgbench -i -s 10 -F 80 postgres
./pgbench -c 4 -j 4 -T 600 postgres

Recovery benchmark:

postgres org postgres pg lz
optimization
Recovery(sec) Recovery(sec)
pgbench org 11 11
pgbench 1800 16 11

This shows that with your patch recovery performance is also improved.

There is one more defect in recovery which is fixed in attached patch
pglz-with-micro-optimizations-3.patch.
In pglz_find_match(), it was going beyond maxlen for comparision due to
which encoded data was not properly written to WAL.

Finally, as per my work further to your patch, the best patch will be by
fixing recovery defects and changes for Approach-1.

With Regards,
Amit Kapila.

Attachments:

test_readings_on.txttext/plain; name=test_readings_on.txtDownload
pgbench_pg_lz_mod_sync_commit_on.htmtext/html; name=pgbench_pg_lz_mod_sync_commit_on.htmDownload
pglz-with-micro-optimizations-3.patchapplication/octet-stream; name=pglz-with-micro-optimizations-3.patchDownload+755-59
#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#18)

On Wednesday, March 13, 2013 5:50 PM Amit Kapila wrote:

On Friday, March 08, 2013 9:22 PM Amit Kapila wrote:

On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous

patch):

I've been doing investigating the pglz option further, and doing
performance comparisons of the pglz approach and this patch. I'll
begin with some numbers:

Based on your patch, I have tried some more optimizations:

Based on numbers provided by Daniel for compression methods, I tried Snappy
Algorithm for encoding
and it addresses most of your concerns that it should not degrade
performance for majority cases.

postgres orginal:

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1232916160 | 34.0338308811188
two short fields, one changed | 1232909704 | 32.8722319602966
two short fields, both changed | 1236770128 | 35.445415019989
one short and one long field, no change | 1053000144 | 23.2983899116516
ten tiny fields, all changed | 1397452584 | 40.2718069553375
hundred tiny fields, first 10 changed | 622082664 | 21.7642788887024
hundred tiny fields, all changed | 626461528 | 20.964781999588
hundred tiny fields, half changed | 621900472 | 21.6473519802094
hundred tiny fields, half nulled | 557714752 | 19.0088789463043
(9 rows)

postgres encode wal using snappy:

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 1232915128 | 34.6910920143127
two short fields, one changed | 1238902520 | 34.2287850379944
two short fields, both changed | 1233882056 | 35.3292708396912
one short and one long field, no change | 733095168 | 20.3494939804077
ten tiny fields, all changed | 1314959744 | 38.969575881958
hundred tiny fields, first 10 changed | 483275136 | 19.6973309516907
hundred tiny fields, all changed | 481755280 | 19.7665288448334
hundred tiny fields, half changed | 488693616 | 19.7246761322021
hundred tiny fields, half nulled | 483425712 | 18.6299569606781
(9 rows)

Changes are to call snappy compress and decompress for encoding and decoding
in patch.
I am doing encoding for tup length greater than 32, as for too small tuples
it might not make much sense for encoding.

On my m/c while using snapy compress/decompress, it was giving stack
corruption for first 4 bytes, so I put below fix to proceed.
I am looking into reason of same.
1. snappy_compress - Increment the encoded data buffer with 4 bytes before
encryption starts.
2. snappy_uncompress - Decrement the 4 bytes increment done during compress.

3. snappy_uncompressed_length - Decrement the 4 bytes increment done during
compress.

For LZ compression patch, there was small problem in identifying max length
which I have corrected in separate patch
'pglz-with-micro-optimizations-4.patch'

In my opinion, there can be following ways for this patch:
1. Use LZ compression, but provide a way to user so that it can be avoided
for cases where much compression is not possible.
I see this as a viable way because most updates will update only have few
columns and rest data would be same.
2. Use snappy API's, do anyone know of standard library of snappy?
3. Provide multiple compression ways, so depending on usage, user can use
appropriate one.

Feedback?

With Regards,
Amit Kapila.

Attachments:

snappy_algo_v1.patchapplication/octet-stream; name=snappy_algo_v1.patchDownload+1411-2
wal_update_snappy_concat_oldandnew_tuple_v1.patchapplication/octet-stream; name=wal_update_snappy_concat_oldandnew_tuple_v1.patchDownload+346-73
pglz-with-micro-optimizations-4.patchapplication/octet-stream; name=pglz-with-micro-optimizations-4.patchDownload+755-59
#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#14)

On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:

On 04.03.2013 06:39, Amit Kapila wrote:

On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:

On 02/05/2013 11:53 PM, Amit Kapila wrote:

Performance data for the patch is attached with this mail.
Conclusions from the readings (these are same as my previous

patch):

The attached patch also just adds overhead in most cases, but the
overhead is much smaller in the worst case. I think that's the right
tradeoff here - we want to avoid scenarios where performance falls off
the cliff. That said, if you usually just get a slowdown, we certainly
can't make this the default, and if we can't turn it on by default,
this probably just isn't worth it.

The attached patch contains the variable-hash-size changes I posted in
the "Optimizing pglz compressor". But in the delta encoding function,
it goes further than that, and contains some further micro-
optimizations:
the hash is calculated in a rolling fashion, and it uses a specialized
version of the pglz_hist_add macro that knows that the input can't
exceed 4096 bytes. Those changes shaved off some cycles, but you could
probably do more. One idea is to only add every 10 bytes or so to the
history lookup table; that would sacrifice some compressibility for
speed.

If you could squeeze pglz_delta_encode function to be cheap enough that
we could enable this by default, this would be pretty cool patch. Or at
least, the overhead in the cases that you get no compression needs to
be brought down, to about 2-5 % at most I think. If it can't be done
easily, I feel that this probably needs to be dropped.

After trying some more on optimizing pglz_delta_encode(), I found that if we
use new data also in history, then the results of compression
and cpu utilization are much better.

In addition to the pg lz micro optimization changes, following changes are
done in modified patch

1. The unmatched new data is also added to the history which can be
referenced later.
2. To incorporate this change in the lZ algorithm, 1 extra control bit is
needed to indicate if data is from old or new tuple

Performance Data
-----------------

Head code:

testname | wal_generated | duration
-----------------------------------------+---------------+------------------

two short fields, no change | 1232908016 | 36.3914430141449
two short fields, one changed | 1232904040 | 36.5231261253357
two short fields, both changed | 1235215048 | 37.7455959320068
one short and one long field, no change | 1051394568 | 24.418487071991
ten tiny fields, all changed | 1395189872 | 43.2316210269928
hundred tiny fields, first 10 changed | 622156848 | 21.9155580997467
hundred tiny fields, all changed | 625962056 | 22.3296411037445
hundred tiny fields, half changed | 621901128 | 21.3881061077118
hundred tiny fields, half nulled | 557708096 | 19.4633228778839

pglz-with-micro-optimization-compress-using-newdata-1:

testname | wal_generated | duration
-----------------------------------------+---------------+------------------

two short fields, no change | 1235992768 | 37.3365149497986
two short fields, one changed | 1240979256 | 36.897796869278
two short fields, both changed | 1236079976 | 38.4273149967194
one short and one long field, no change | 651010944 | 20.9490079879761
ten tiny fields, all changed | 1315606864 | 42.5771369934082
hundred tiny fields, first 10 changed | 459134432 | 17.4556930065155
hundred tiny fields, all changed | 456506680 | 17.8865270614624
hundred tiny fields, half changed | 454784456 | 18.0130441188812
hundred tiny fields, half nulled | 486675784 | 18.6600229740143

Observation
---------------
1. It yielded compression in more cases (refer all cases of hundred tiny
fields)
2. CPU- utilization is also better.

Performance data for pgbench related scenarios is attached in document
(pgbench_lz_opt_compress_using_newdata.htm)

1. Better reduction in WAL
2. TPS increase can be observed after records size is >=250
3. There is small performance penality for single-thread (0.04~3.45), but
when penality is 3.45 in single thread, for 8 threads TPS improvement is
high.

Do you think it matches the conditions you have in mind for further
proceeding of this patch?

Thanks to Hari Babu for helping in implementation of this idea and taking
performance data.

With Regards,
Amit Kapila.

Attachments:

pglz-with-micro-optimization-compress-using-newdata-1.patchapplication/octet-stream; name=pglz-with-micro-optimization-compress-using-newdata-1.patchDownload+832-58
pgbench_lz_opt_compress_using_newdata.htmtext/html; name=pgbench_lz_opt_compress_using_newdata.htmDownload
#21Haribabu kommi
haribabu.kommi@huawei.com
In reply to: Amit Kapila (#20)
#22Mike Blackwell
mike.blackwell@rrd.com
In reply to: Amit Kapila (#2)
#23Josh Berkus
josh@agliodbs.com
In reply to: Amit Kapila (#2)
#24Amit Kapila
amit.kapila16@gmail.com
In reply to: Mike Blackwell (#22)
#25Mike Blackwell
mike.blackwell@rrd.com
In reply to: Amit Kapila (#2)
#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Mike Blackwell (#25)
#27Greg Smith
gsmith@gregsmith.com
In reply to: Amit Kapila (#24)
#28Stephen Frost
sfrost@snowman.net
In reply to: Greg Smith (#27)
#29Haribabu kommi
haribabu.kommi@huawei.com
In reply to: Greg Smith (#27)
#30Greg Smith
gsmith@gregsmith.com
In reply to: Haribabu kommi (#29)
#31Andres Freund
andres@anarazel.de
In reply to: Haribabu kommi (#29)
#32Greg Smith
gsmith@gregsmith.com
In reply to: Andres Freund (#31)
#33Amit Kapila
amit.kapila16@gmail.com
In reply to: Greg Smith (#30)
#34Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#31)
#35Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#34)
#36Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#35)
#37Haribabu kommi
haribabu.kommi@huawei.com
In reply to: Amit Kapila (#33)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#30)
#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#38)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#39)
#41Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#41)
#43Haribabu kommi
haribabu.kommi@huawei.com
In reply to: Robert Haas (#42)
#44Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu kommi (#43)
#45Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#42)
#46Haribabu kommi
haribabu.kommi@huawei.com
In reply to: Amit Kapila (#45)
#47Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu kommi (#46)
#48Haribabu kommi
haribabu.kommi@huawei.com
In reply to: Amit Kapila (#47)
#49Amit Kapila
amit.kapila16@gmail.com
In reply to: Haribabu kommi (#48)
#50Peter Eisentraut
peter_e@gmx.net
In reply to: Amit Kapila (#45)
#51Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Eisentraut (#50)
#52Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#51)
#53Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#52)
#54Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#49)
#55Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#54)
#56Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#55)
#57Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#56)
#58Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#57)
#59Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#58)
#60Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#55)
#61Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#60)
#62Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#60)
#63Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#62)
#64Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#63)
#65Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#63)
#66Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#65)
In reply to: Robert Haas (#38)
#68Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#66)
#69Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#68)
#70Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#69)
#71Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#69)
#72Jinyu
call_jinyu@126.com
In reply to: Heikki Linnakangas (#71)
#73Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#71)
#74Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#73)
#75Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#74)
#76Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#75)
#77Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#76)
#78Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#76)
#79Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#78)
#80Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#79)
#81Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#80)
#82Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#81)
#83Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#82)
#84Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#83)
In reply to: Andres Freund (#83)
#86Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#85)
In reply to: Andres Freund (#86)
#88Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#81)
#89Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#86)
#90Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#88)
#91Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#76)
#92Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#91)
#93Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#92)
#94Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#90)
#95Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#93)
In reply to: Heikki Linnakangas (#89)
#97Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#91)
#98Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#93)
#99Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#90)
#100Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Robert Haas (#99)
#101Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#98)
#102Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#94)
#103Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#101)
#104Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#96)
#105Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#104)
#106Bruce Momjian
bruce@momjian.us
In reply to: Amit Kapila (#105)
#107Amit Kapila
amit.kapila16@gmail.com
In reply to: Bruce Momjian (#106)
#108Claudio Freire
klaussfreire@gmail.com
In reply to: Amit Kapila (#107)
#109Amit Kapila
amit.kapila16@gmail.com
In reply to: Claudio Freire (#108)
#110Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#104)
#111Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#103)
#112Bruce Momjian
bruce@momjian.us
In reply to: Amit Kapila (#109)
#113Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#111)
#114Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#113)
#115Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#114)
#116Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#115)
#117Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#91)
#118Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#117)
#119Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#118)
#120Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#119)
#121Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#120)
#122Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#121)
#123Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#119)
#124Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#123)
#125Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#118)
#126Amit Kapila
amit.kapila16@gmail.com
In reply to: Heikki Linnakangas (#123)
#127Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Amit Kapila (#125)
#128Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#127)