Re: Performance Improvement by reducing WAL for Update Operation

Started by Amit Kapilaover 13 years ago41 messageshackers
Jump to latest
#1Amit Kapila
amit.kapila16@gmail.com

On Mon, 29 Oct 2012 20:02:11 +0530 Amit Kapila wrote:

On Sunday, October 28, 2012 12:28 AM Heikki Linnakangas wrote:

One idea is to use the LZ format in the WAL record, but use your
memcmp() code to construct it. I believe the slow part in LZ compression
is in trying to locate matches in the "history", so if you just replace
that with your code that's aware of the column boundaries and uses
simple memcmp() to detect what parts changed, you could create LZ
compressed output just as quickly as the custom encoded format. It would
leave the door open for making the encoding smarter or to do actual
compression in the future, without changing the format and the code to
decode it.

This is good idea. I shall try it.

In the existing algorithm for storing the new data which is not present in
the history, it needs 1 control byte for
every 8 bytes of new data which can increase the size of the compressed
output as compare to our delta encoding approach.

Approach-2
---------------
Use only one bit for control data [0 - Length and new data, 1 - pick from
history based on OFFSET-LENGTH]
The modified bit value (0) is to handle the new field data as a continuous
stream of data, instead of treating every byte as a new data.

Attached are the patches

1. wal_update_changes_lz_v4 - to use LZ Approach with memcmp to construct WAL record

2. wal_update_changes_modified_lz_v5 - to use modified LZ Approach as mentioned above as Approach-2

The main Changes as compare to previous patch are as follows:

1. In heap_delta_encode, use LZ encoding instead of Custom encoding.

2. Instead of get_tup_info(), introduced heap_getattr_with_len() macro based on suggestion from Noah.

3. LZ macro's moved from .c to .h, as they need to be used for encoding.

4. Changed the format for function arguments for heap_delta_encode()/heap_delta_decode() based on suggestion from Noah.

Performance Data:

[X]

[X]

Results:
Threads

1

2

4

8

Patch
Tps

wal size(GB)

Tps

wal size(GB)

Tps

wal size(GB)

Tps

wal size(GB)

Xlog scale
861

4.36

1463

7.33

2135

10.74

2689

13.56

Xlog scale +Original LZ
892

2.46

1685

3.35

3232

6.02

5296

9.20

Xlog scale +Modified LZ
852

2.35

1664

3.25

3229

5.71

5431

8.68

These are still WIP patches. Some cleanup has to be done.

Apart from that, I think the reason why still the performance is not same as Custom delta encoding Approach, is that it has IGN command due to which for all

the unchanged data in end, there are no commands and it was able to form tuple in decode using old tuple.

I shall write the wal_update_changes_custom_delta_v6, and then we can compare all the three patches performance data and decide which one to go based on results.

Suggestions/Comments?

With Regards,

Amit Kapila.

Attachments:

wal_update_changes_lz_v4.patchapplication/octet-stream; name=wal_update_changes_lz_v4.patchDownload+826-315
wal_update_changes_modified_lz_v5.patchapplication/octet-stream; name=wal_update_changes_modified_lz_v5.patchDownload+888-369
#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#1)

On Thu, 8 Nov 2012 17:33:54 +0000 Amit Kapila wrote:
On Mon, 29 Oct 2012 20:02:11 +0530 Amit Kapila wrote:
On Sunday, October 28, 2012 12:28 AM Heikki Linnakangas wrote:

One idea is to use the LZ format in the WAL record, but use your
memcmp() code to construct it. I believe the slow part in LZ compression
is in trying to locate matches in the "history", so if you just replace
that with your code that's aware of the column boundaries and uses
simple memcmp() to detect what parts changed, you could create LZ
compressed output just as quickly as the custom encoded format. It would
leave the door open for making the encoding smarter or to do actual
compression in the future, without changing the format and the code to
decode it.

This is good idea. I shall try it.

In the existing algorithm for storing the new data which is not present in
the history, it needs 1 control byte for
every 8 bytes of new data which can increase the size of the compressed
output as compare to our delta encoding approach.

Approach-2

---------------

Use only one bit for control data [0 - Length and new data, 1 - pick from
history based on OFFSET-LENGTH]
The modified bit value (0) is to handle the new field data as a continuous
stream of data, instead of treating every byte as a new data.

Attached are the patches
1. wal_update_changes_lz_v4 - to use LZ Approach with memcmp to construct WAL record
2. wal_update_changes_modified_lz_v5 - to use modified LZ Approach as mentioned above as Approach-2

The main Changes as compare to previous patch are as follows:
1. In heap_delta_encode, use LZ encoding instead of Custom encoding.
2. Instead of get_tup_info(), introduced heap_getattr_with_len() macro based on suggestion from Noah.
3. LZ macro's moved from .c to .h, as they need to be used for encoding.
4. Changed the format for function arguments for heap_delta_encode()/heap_delta_decode() based on suggestion from Noah.

Please find the updated patches attached with this mail.

Modification in these Patches apart from above:

1. Traverse the tuple only once (previously it needs to traverse 3 times) to check if particular offset matches and get the offset to generate encoded tuple.

To achieve this I have modified function heap_tuple_attr_equals() to heap_attr_get_length_and_check_equals(), so that it can get the length of tuple attribute

which can be used to calculate offset. A separate function can also be written to achieve the same.

2. Improve the comments in code.

Performance Data:

1. Please refer testcase in attached file pgbench_250.c

Refer Function used to create random string at end of mail.

2. The detail data and configuration settings can be reffered in attached files (pgbench_encode_withlz_ff100 & pgbench_encode_withlz_ff80).

Benchmark results with -F 100:

-Patch- -tps@-c1- -tps@-c2- -tps@-c4- -tps@-c8- -WAL@-c8-
xlogscale 802 1453 2253 2643 13.99 GB
xlogscale+org lz 807 1602 3168 5140 9.50 GB
xlogscale+mod lz 796 1620 3216 5270 9.16 GB

Benchmark results with -F 80:

-Patch- -tps@-c1- -tps@-c2- -tps@-c4- -tps@-c8- -WAL@-c8-
xlogscale 811 1455 2148 2704 13.6 GB
xlogscale+org lz 829 1684 3223 5325 9.13 GB
xlogscale+mod lz 801 1657 3263 5488 8.86 GB

I shall write the wal_update_changes_custom_delta_v6, and then we can compare all the three patches performance data and decide which one to go based on results.

The results with this are not better than above 2 Approaches, so I am not attaching it.

Function used to create randome string

--------------------------------------------------------

CREATE OR REPLACE FUNCTION random_text_md5_v2(INTEGER)
RETURNS TEXT
LANGUAGE SQL
AS $$

select upper(
substring(
(
SELECT string_agg(md5(random()::TEXT), '')
FROM generate_series(1, CEIL($1 / 32.)::integer)
),
$1)
);

$$;

Suggestions/Comments?

With Regards,

Amit Kapila.

Attachments:

wal_update_changes_lz_v4.patchapplication/octet-stream; name=wal_update_changes_lz_v4.patchDownload+897-308
wal_update_changes_mod_lz_v5.patchapplication/octet-stream; name=wal_update_changes_mod_lz_v5.patchDownload+959-362
pgbench_250.ctext/plain; name=pgbench_250.cDownload
pgbench_encode_withlz_ff100.htmtext/html; name=pgbench_encode_withlz_ff100.htmDownload
pgbench_encode_withlz_ff80.htmtext/html; name=pgbench_encode_withlz_ff80.htmDownload
#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#2)

Hello, I looked into the patch and have some comments.

From the restriction of the time for this rather big patch,
please excuse that these comments are on a part of it. Others
will follow in few days.

==== heaptuple.c

noncachegetattr(_with_len):

- att_getlength should do strlen as worst case or VARSIZE_ANY
which is heavier than doing one comparizon, so I recommend to
add 'if (len)' as the restriction for doing this, and give NULL
as &len to nocachegetattr_with_len in nocachegetattr.

heap_attr_get_length_and_check_equals:

- Size seems to be used conventionary as the type for memory
object length, so it might be better using Size instead of
int32 as the type for *tup[12]_attr_len in parameter.

- This function returns always false for attrnum <= 0 as whole
tuple or some system attrs comparison regardless of the real
result, which is a bit different from the anticipation which
the name gives. If you need to keep this optimization, it
should have the name more specific to the purpose.

haap_delta_encode:

- Some misleading variable names (like match_not_found),
some reatitions of similiar codelets (att_align_pointer, pglz_out_tag),
misleading slight difference of the meanings of variables of
similar names(old_off and new_off and the similar pairs),
and bit tricky use of pglz_out_add and pglz_out_tag with length = 0.

These are welcome to be modified for better readability.

==== heapam.c

fastgetattr_with_len

- Missing left paren in the line 867 ('nocachegetattr_with_len(tup)...')

- Missing enclosing paren in heapam.c:879 (len, only on style)

- Allowing len = NULL will be good for better performance, like
noncachegetattr.

fastgetattr

- I suppose that the coding covension here is that macro and
alternative c-code are expected to be look similar. fastgetattr
looks quite differ to corresponding macro.

...

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#3)

On Friday, December 07, 2012 2:28 PM Kyotaro HORIGUCHI wrote:

Hello, I looked into the patch and have some comments.

Thank you for reviewing the patch.

From the restriction of the time for this rather big patch,
please excuse that these comments are on a part of it. Others
will follow in few days.

It's perfectly fine.

==== heaptuple.c

noncachegetattr(_with_len):

- att_getlength should do strlen as worst case or VARSIZE_ANY
which is heavier than doing one comparizon, so I recommend to
add 'if (len)' as the restriction for doing this, and give NULL
as &len to nocachegetattr_with_len in nocachegetattr.

Fixed.

heap_attr_get_length_and_check_equals:

- Size seems to be used conventionary as the type for memory
object length, so it might be better using Size instead of
int32 as the type for *tup[12]_attr_len in parameter.

Fixed.

- This function returns always false for attrnum <= 0 as whole
tuple or some system attrs comparison regardless of the real
result, which is a bit different from the anticipation which
the name gives. If you need to keep this optimization, it
should have the name more specific to the purpose.

The heap_attr_get_length_and_check_equals function is similar to heap_tuple_attr_equals,
the attrnum <= 0 check is required for heap_tuple_attr_equals.

haap_delta_encode:

- Some misleading variable names (like match_not_found),
some reatitions of similiar codelets (att_align_pointer, pglz_out_tag),
misleading slight difference of the meanings of variables of
similar names(old_off and new_off and the similar pairs),
and bit tricky use of pglz_out_add and pglz_out_tag with length = 0.

These are welcome to be modified for better readability.

The variable names are modified, please check them once.

The (att_align_pointer, pglz_out_tag) repetition code is added to take care of padding only incase of values are equal.
Use of pglz_out_add and pglz_out_tag with length = 0 is done because of code readability.

==== heapam.c

fastgetattr_with_len

- Missing left paren in the line 867 ('nocachegetattr_with_len(tup)...')

- Missing enclosing paren in heapam.c:879 (len, only on style)

- Allowing len = NULL will be good for better performance, like
noncachegetattr.

Fixed. except len=NULL because fastgetattr is modified as below comment.

fastgetattr

- I suppose that the coding covension here is that macro and
alternative c-code are expected to be look similar. fastgetattr
looks quite differ to corresponding macro.

Fixed.

Another change is also done to handle the history size of 2 bytes which is possible with the usage of LZ macro's for delta encoding.

With Regards,
Amit Kapila.

Attachments:

wal_update_changes_lz_v5.patchapplication/octet-stream; name=wal_update_changes_lz_v5.patchDownload+938-301
wal_update_changes_mod_lz_v6.patchapplication/octet-stream; name=wal_update_changes_mod_lz_v6.patchDownload+1008-363
#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#4)

Thank you.

heap_attr_get_length_and_check_equals:

..

- This function returns always false for attrnum <= 0 as whole
tuple or some system attrs comparison regardless of the real
result, which is a bit different from the anticipation which
the name gives. If you need to keep this optimization, it
should have the name more specific to the purpose.

The heap_attr_get_length_and_check_equals function is similar to heap_tuple_attr_equals,
the attrnum <= 0 check is required for heap_tuple_attr_equals.

Sorry, you're right.

haap_delta_encode:

- Some misleading variable names (like match_not_found),
some reatitions of similiar codelets (att_align_pointer, pglz_out_tag),
misleading slight difference of the meanings of variables of
similar names(old_off and new_off and the similar pairs),
and bit tricky use of pglz_out_add and pglz_out_tag with length = 0.

These are welcome to be modified for better readability.

The variable names are modified, please check them once.

The (att_align_pointer, pglz_out_tag) repetition code is added to take care of padding only incase of values are equal.
Use of pglz_out_add and pglz_out_tag with length = 0 is done because of code readability.

Oops! Sorry for mistake. My point was that the bases for old_off
(of match_off) and dp, not new_off. It is no unnatural. Namings
had not been the problem and the function was perfect as of the
last patch. I'd been confised by the asymmetry between match_off
to pglz_out_tag and dp to pglz_out_add.

Another change is also done to handle the history size of 2 bytes which is possible with the usage of LZ macro's for delta encoding.

Good catch. This seems to have been a potential bug which does no
harm when called from pglz_compress..

==========

Looking into wal_update_changes_mod_lz_v6.patch, I understand
that this patch experimentally adds literal data segment which
have more than single byte in PG-LZ algorithm. According to
pglz_find_match, memCMP is slower than 'while(*s && *s == *d)' if
len < 16 and I suppose it is probably true at least for 4 byte
length data. This is also applied on encoding side. If this mod
does no harm to performance, I want to see this applied also to
pglz_comress.

By the way, the comment on pg_lzcompress.c:690 seems to quite
differ from what the code does.

regards,

*1: http://archives.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C38285495B0@szxeml509-mbx

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#5)

On Monday, December 10, 2012 2:41 PM Kyotaro HORIGUCHI wrote:

Thank you.

heap_attr_get_length_and_check_equals:

..

- This function returns always false for attrnum <= 0 as whole
tuple or some system attrs comparison regardless of the real
result, which is a bit different from the anticipation which
the name gives. If you need to keep this optimization, it
should have the name more specific to the purpose.

The heap_attr_get_length_and_check_equals function is similar to

heap_tuple_attr_equals,

the attrnum <= 0 check is required for heap_tuple_attr_equals.

Sorry, you're right.

haap_delta_encode:

- Some misleading variable names (like match_not_found),
some reatitions of similiar codelets (att_align_pointer,

pglz_out_tag),

misleading slight difference of the meanings of variables of
similar names(old_off and new_off and the similar pairs),
and bit tricky use of pglz_out_add and pglz_out_tag with length =

0.

These are welcome to be modified for better readability.

The variable names are modified, please check them once.

The (att_align_pointer, pglz_out_tag) repetition code is added to take

care of padding only incase of values are equal.

Use of pglz_out_add and pglz_out_tag with length = 0 is done because

of code readability.

Oops! Sorry for mistake. My point was that the bases for old_off
(of match_off) and dp, not new_off. It is no unnatural. Namings
had not been the problem and the function was perfect as of the
last patch.

I think new naming I have done are more meaningful, do you think I should
revert to previous patch one's.

I'd been confised by the asymmetry between match_off
to pglz_out_tag and dp to pglz_out_add.

If we see the usage of pglz_out_tag and pglz_out_literal in pglz_compress(),
it is same as I have used.

Another change is also done to handle the history size of 2 bytes

which is possible with the usage of LZ macro's for delta encoding.

Good catch. This seems to have been a potential bug which does no
harm when called from pglz_compress..

==========

Looking into wal_update_changes_mod_lz_v6.patch, I understand
that this patch experimentally adds literal data segment which
have more than single byte in PG-LZ algorithm. According to
pglz_find_match, memCMP is slower than 'while(*s && *s == *d)' if
len < 16 and I suppose it is probably true at least for 4 byte
length data. This is also applied on encoding side. If this mod
does no harm to performance, I want to see this applied also to
pglz_comress.

Where in pglz_comress(), do you want to see similar usage?
Or do you want to see such use in function
heap_attr_get_length_and_check_equals(), where it compares 2 attributes.

By the way, the comment on pg_lzcompress.c:690 seems to quite
differ from what the code does.

I shall fix this.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#6)

Hello, I took the perfomance figures for this patch.

CentOS6.3/Core i7
wal_level = archive, checkpoint_segments = 30 / 5min

A. Vanilla pgbench, postgres is HEAD
B. Vanilla pgbench, postgres is with this patch (wal_update_changes_lz_v5)
C. Modified pgbench(Long text), postgres is HEAD
D. Modified pgbench(Long text), postgres is with this patch

Running doing pgbench -s 10 -i, pgbench -c 20 -T 2400

#trans/s WAL MB WAL kB/tran
1A 437 1723 1.68
1B 435 (<1% slower than A) 1645 1.61 (96% of A)
1C 149 5073 14.6
1D 174 (17% faster than C) 5232 12.8 (88% of C)

Restoring with the wal archives yielded during the first test.

Recv sec s/trans
2A 61 0.0581
2B 62 0.0594 (2% slower than A)
2C 287 0.805
2D 314 0.750 (7% faster than C)

For vanilla pgbench, WAL size shrinks slightly and performance
seems very slightly worse than unpatched postgres(1A vs. 1B). It
can be safely say that no harm on performance even outside of the
effective range of this patch. On the other hand, the performance
gain becomes 17% within the effective range (1C vs. 1D).

Recovery performance looks to have the same tendency. It looks to
produce very small loss outside of the effective range (2A
vs. 2B) and significant gain within (2C vs. 2D ).

As a whole, this patch brings very large gain in its effective
range - e.g. updates of relatively small portions of tuples, but
negligible loss of performance is observed outside of its
effective range.

I'll mark this patch as 'Ready for Committer' as soon as I get
finished confirming the mod patch.

==========

I think new naming I have done are more meaningful, do you think I should
revert to previous patch one's.

New naming is more meaningful, and a bit long. I don't think it
should be reverted.

Looking into wal_update_changes_mod_lz_v6.patch, I understand
that this patch experimentally adds literal data segment which
have more than single byte in PG-LZ algorithm. According to
pglz_find_match, memCMP is slower than 'while(*s && *s == *d)' if
len < 16 and I suppose it is probably true at least for 4 byte
length data. This is also applied on encoding side. If this mod
does no harm to performance, I want to see this applied also to
pglz_comress.

Where in pglz_comress(), do you want to see similar usage?
Or do you want to see such use in function
heap_attr_get_length_and_check_equals(), where it compares 2 attributes.

My point was the format for literal segments. It seems to reduce
about an eighth of literal segments. But the effectiveness under
real environment does not promising.. Forget it. It's just a
fancy.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#7)

On Friday, December 14, 2012 2:32 PM Kyotaro HORIGUCHI wrote:

Hello, I took the perfomance figures for this patch.

CentOS6.3/Core i7
wal_level = archive, checkpoint_segments = 30 / 5min

A. Vanilla pgbench, postgres is HEAD
B. Vanilla pgbench, postgres is with this patch
(wal_update_changes_lz_v5)
C. Modified pgbench(Long text), postgres is HEAD
D. Modified pgbench(Long text), postgres is with this patch

Running doing pgbench -s 10 -i, pgbench -c 20 -T 2400

#trans/s WAL MB WAL kB/tran
1A 437 1723 1.68
1B 435 (<1% slower than A) 1645 1.61 (96% of A)
1C 149 5073 14.6
1D 174 (17% faster than C) 5232 12.8 (88% of C)

Restoring with the wal archives yielded during the first test.

Recv sec s/trans
2A 61 0.0581
2B 62 0.0594 (2% slower than A)
2C 287 0.805
2D 314 0.750 (7% faster than C)

For vanilla pgbench, WAL size shrinks slightly and performance
seems very slightly worse than unpatched postgres(1A vs. 1B). It
can be safely say that no harm on performance even outside of the
effective range of this patch. On the other hand, the performance
gain becomes 17% within the effective range (1C vs. 1D).

Recovery performance looks to have the same tendency. It looks to
produce very small loss outside of the effective range (2A
vs. 2B) and significant gain within (2C vs. 2D ).

As a whole, this patch brings very large gain in its effective
range - e.g. updates of relatively small portions of tuples, but
negligible loss of performance is observed outside of its
effective range.

I'll mark this patch as 'Ready for Committer' as soon as I get
finished confirming the mod patch.

Thank you very much.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#6)

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

And took the performance figures for 4 types of modification
versus 2 benchmarks.

I've see small performace gain (4-8% for execution, and 6-12% for
recovery) and 16% WAL shrink for modified pgbench enhances the
benefit of this patch.

On the other hand I've found no significant loss of performance
for execution and 4% reduction of WAL for original pgbench, but
there might be 4-8% performance loss for recovery.

Attached patches are listed below.

wal_update_changes_lz_v5.patch

Rather straight implement of wal compression using existing
pg_lz compress format.

wal_update_changes_mod_lz_v6_2.patch

Modify pg_lz to have bulk literal segment format which is
available only for WAL compression. Misplaced comment fixed.

The detail of performance follows.
=====
I've tested involving the mod patch and 'modified' mod
patch.

CentOS6.3/Core i7
wal_level = archive, checkpoint_segments = 30 / 5min

wal_update_changes_mod_lz_v6+ is the version in which memcpy for
segment shorter than 16 bytes to be copied by while(*s)
*d++=*s++.

postgres pgbench
A. HEAD Original
B. wal_update_changes_lz_v5 Original
C. wal_update_changes_mod_lz_v6 Original
D. wal_update_changes_mod_lz_v6+ Original
E. HEAD attached with this patch
F. wal_update_changes_lz_v5 attached with this patch
G. wal_update_changes_mod_lz_v6 attached with this patch
H. wal_update_changes_mod_lz_v6+ attached with this patch

Running doing pgbench -s 10 -i, pgbench -c 10 -j 10 -T 1200

#trans/s WAL MB WAL kB/tran
1A 346 760 1.87
1B 347 730 1.80 (96% of A)
1C 346 729 1.80 (96% of A)
1D 347 730 1.80 (96% of A)

1E 192 2790 6.20
1F 200 (4% faster than E) 2431 5.19 (84% of D)
1G 207 (8% faster than E) 2563 5.28 (85% of D)
1H 199 (4% faster than E) 2421 5.19 (84% of D)

Recovery time

Recv sec us/trans
2A 26 62.6
2B 27 64.8 (4% slower than A)
2C 28 67.4 (8% slower than A)
2D 26 62.4 (same as A)

2E 130 629
2F 149 579 ( 8% faster than E)
2G 128 592 ( 6% faster than E)
2H 130 553 (12% faster than E)

For vanilla pgbench, WAL size shrinks slightly and performance
seems same as unpatched postgres(1A vs. 1B, 1C, 1D). For modified
pgbench, WAL size shrinks by about 17% and performance seems to
have a gain by several percent.

Recovery performance looks to have the same tendency. It looks to
produce very small loss outside of the effective range (2A
vs. 2B, 2C) and significant gain within (2E vs. 2F, 2G, 2H).

As a whole, this patch brings very large gain in its effective
range - e.g. updates of relatively small portions in a tuple, but
negligible loss of performance is observed outside of its
effective range on the test machine. I suppose the losses will be
emphasized by the more higher performance of seq write of WAL
devices

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

wal_update_changes_lz_v5.patchtext/x-patch; charset=us-asciiDownload+938-301
wal_update_changes_mod_lz_v6_2.patchtext/x-patch; charset=us-asciiDownload+874-201
#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Kyotaro Horiguchi (#9)

On 28 December 2012 08:07, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

I've had a quick review of the patch to see how close we're getting.
The perf tests look to me like we're getting what we wanted from this
and I'm happy with the recovery performance trade-offs. Well done to
both author and testers.

My comments

* There is a fixed 75% heuristic in the patch. Can we document where
that came from? Can we have a parameter that sets that please? This
can be used to have further tests to confirm the useful setting of
this. I expect it to be removed before we release, but it will help
during beta.

* The compression algorithm depends completely upon new row length
savings. If the new row is short, it would seem easier to just skip
the checks and include it anyway. We can say if old and new vary in
length by > 50% of each other, just include new as-is, since the rows
very clearly differ in a big way. Also, if tuple is same length as
before, can we compare the whole tuple at once to save doing
per-column checks?

* If full page writes is on and the page is very old, we are just
going to copy the whole block. So why not check for that rather than
do all these push ups and then just copy the page anyway?

* TOAST is not handled at all. No comments about it, nothing. Does
that mean it hasn't been considered? Or did we decide not to care in
this release? Presumably that means we are comparing toast pointers
byte by byte to see if they are the same?

* I'd like to see a specific test in regression that is designed to
exercise the code here. That way we will be certain that the code is
getting regularly tested.

* The internal docs are completely absent. We need at least a whole
page of descriptive comment, discussing trade-offs and design
decisions. This is very important because it will help locate bugs
much faster if these things are clealry documented. It also helps
reviewers. This is a big timewaster for committers because you have to
read the whole patch and understand it before you can attempt to form
opinions. Commits happen quicker and easier with good comments.

* Lots of typos in comments. Many comments say nothing more than the
words already used in the function name itself

* "flags" variables are almost always int or uint in PG source.

* PGLZ_HISTORY_SIZE needs to be documented in the place it is defined,
not the place its used. The test if (oldtuplen < PGLZ_HISTORY_SIZE)
really needs to be a test inside the compression module to maintain
better modularity, so the value itself needn't be exported

* Need to mention the WAL format change, or include the change within
the patch so we can see

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#10)

On Friday, December 28, 2012 3:52 PM Simon Riggs wrote:

On 28 December 2012 08:07, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

I've had a quick review of the patch to see how close we're getting.
The perf tests look to me like we're getting what we wanted from this
and I'm happy with the recovery performance trade-offs. Well done to
both author and testers.

My comments

* There is a fixed 75% heuristic in the patch. Can we document where
that came from?

It is from LZ compression strategy. Refer PGLZ_Strategy.
I will add comment for it.

Can we have a parameter that sets that please? This
can be used to have further tests to confirm the useful setting of
this. I expect it to be removed before we release, but it will help
during beta.

I shall add that for test purpose.

* The compression algorithm depends completely upon new row length
savings. If the new row is short, it would seem easier to just skip
the checks and include it anyway. We can say if old and new vary in
length by > 50% of each other, just include new as-is, since the rows
very clearly differ in a big way.

I think it makes more sense. So I shall update the patch.

Also, if tuple is same length as
before, can we compare the whole tuple at once to save doing
per-column checks?

I shall evaluate and discuss with you.

* If full page writes is on and the page is very old, we are just
going to copy the whole block. So why not check for that rather than
do all these push ups and then just copy the page anyway?

I shall check once and update the patch.

* TOAST is not handled at all. No comments about it, nothing. Does
that mean it hasn't been considered? Or did we decide not to care in
this release?

Presumably that means we are comparing toast pointers
byte by byte to see if they are the same?

Yes, currently this patch is doing byte by byte comparison for toast
pointers. I shall add comment.
In future, we can evaluate if further optimizations can be done.

* I'd like to see a specific test in regression that is designed to
exercise the code here. That way we will be certain that the code is
getting regularly tested.

I shall add more specific tests.

* The internal docs are completely absent. We need at least a whole
page of descriptive comment, discussing trade-offs and design
decisions. This is very important because it will help locate bugs
much faster if these things are clealry documented. It also helps
reviewers. This is a big timewaster for committers because you have to
read the whole patch and understand it before you can attempt to form
opinions. Commits happen quicker and easier with good comments.

Do you have any suggestion for where to put this information, any particular
ReadMe?

* Lots of typos in comments. Many comments say nothing more than the
words already used in the function name itself

* "flags" variables are almost always int or uint in PG source.

* PGLZ_HISTORY_SIZE needs to be documented in the place it is defined,
not the place its used. The test if (oldtuplen < PGLZ_HISTORY_SIZE)
really needs to be a test inside the compression module to maintain
better modularity, so the value itself needn't be exported

I shall update the patch to address it.

* Need to mention the WAL format change, or include the change within
the patch so we can see

Sure, I will update this in code comments and internals docs.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#9)

On Friday, December 28, 2012 1:38 PM Kyotaro HORIGUCHI wrote:

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

And took the performance figures for 4 types of modification versus 2
benchmarks.

As a whole, this patch brings very large gain in its effective range -
e.g. updates of relatively small portions in a tuple, but negligible
loss of performance is observed outside of its effective range on the
test machine. I suppose the losses will be emphasized by the more
higher performance of seq write of WAL devices

Thank you very much for the review.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#4)

On 28 December 2012 11:27, Amit Kapila <amit.kapila@huawei.com> wrote:

* The internal docs are completely absent. We need at least a whole
page of descriptive comment, discussing trade-offs and design
decisions. This is very important because it will help locate bugs
much faster if these things are clealry documented. It also helps
reviewers. This is a big timewaster for committers because you have to
read the whole patch and understand it before you can attempt to form
opinions. Commits happen quicker and easier with good comments.

Do you have any suggestion for where to put this information, any particular
ReadMe?

Location is less relevant, since it will show up as additions in the patch.

Put it wherever makes most sense in comparison to existing related
comments/README. I have no preference myself.

If its any consolation, I notice a common issue with patches is lack
of *explanatory* comments, as opposed to line by line comments. So
same review comment to 50-75% of patches I've reviewed recently, which
is also likely why.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#4)

On 28 December 2012 11:27, Amit Kapila <amit.kapila@huawei.com> wrote:

* TOAST is not handled at all. No comments about it, nothing. Does
that mean it hasn't been considered? Or did we decide not to care in
this release?

Presumably that means we are comparing toast pointers
byte by byte to see if they are the same?

Yes, currently this patch is doing byte by byte comparison for toast
pointers. I shall add comment.
In future, we can evaluate if further optimizations can be done.

Just a comment to say that the comparison takes place after TOASTed
columns have been removed. TOAST is already optimised for whole value
UPDATE anyway, so that is the right place to produce the delta.

It does make me think that we can further optimise TOAST by updating
only the parts of a toasted datum that have changed. That will be
useful for JSON and XML applications that change only a portion of
large documents.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#10)

On Friday, December 28, 2012 3:52 PM Simon Riggs wrote:

On 28 December 2012 08:07, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

I've had a quick review of the patch to see how close we're getting.
The perf tests look to me like we're getting what we wanted from this
and I'm happy with the recovery performance trade-offs. Well done to
both author and testers.

* The compression algorithm depends completely upon new row length
savings. If the new row is short, it would seem easier to just skip
the checks and include it anyway. We can say if old and new vary in
length by > 50% of each other, just include new as-is, since the rows
very clearly differ in a big way.

Also, if tuple is same length as
before, can we compare the whole tuple at once to save doing
per-column checks?

If we have to do whole tuple comparison then storing of changed parts might
need to be
be done in a byte-by-byte way rather then at column offset boundaries.
This might not be possible with current algorithm as it stores in WAL
information column-by-column and decrypts also in similar way.

The internal docs are completely absent. We need at least a whole page of

descriptive > comment, discussing trade-offs and design decisions.

Currently I have planned to put it transam/README, as most of WAL
description is present there.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#4)

On 4 January 2013 13:53, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, December 28, 2012 3:52 PM Simon Riggs wrote:

On 28 December 2012 08:07, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

I've had a quick review of the patch to see how close we're getting.
The perf tests look to me like we're getting what we wanted from this
and I'm happy with the recovery performance trade-offs. Well done to
both author and testers.

* The compression algorithm depends completely upon new row length
savings. If the new row is short, it would seem easier to just skip
the checks and include it anyway. We can say if old and new vary in
length by > 50% of each other, just include new as-is, since the rows
very clearly differ in a big way.

Also, if tuple is same length as
before, can we compare the whole tuple at once to save doing
per-column checks?

If we have to do whole tuple comparison then storing of changed parts might
need to be
be done in a byte-by-byte way rather then at column offset boundaries.
This might not be possible with current algorithm as it stores in WAL
information column-by-column and decrypts also in similar way.

OK, please explain in comments.

The internal docs are completely absent. We need at least a whole page of

descriptive > comment, discussing trade-offs and design decisions.

Currently I have planned to put it transam/README, as most of WAL
description is present there.

But also in comments for each major function.

Thanks

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#16)

On Friday, January 04, 2013 8:03 PM Simon Riggs wrote:
On 4 January 2013 13:53, Amit Kapila <amit.kapila@huawei.com> wrote:

On Friday, December 28, 2012 3:52 PM Simon Riggs wrote:

On 28 December 2012 08:07, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, I saw this patch and confirmed that

- Coding style looks good.
- Appliable onto HEAD.
- Some mis-codings are fixed.

I've had a quick review of the patch to see how close we're getting.
The perf tests look to me like we're getting what we wanted from this
and I'm happy with the recovery performance trade-offs. Well done to
both author and testers.

Update patch contains handling of below Comments

* There is a fixed 75% heuristic in the patch. Can we document where
that came from? Can we have a parameter that sets that please? This
can be used to have further tests to confirm the useful setting of
this. I expect it to be removed before we release, but it will help
during beta.

Added a guc variable wal_update_compression_ratio to set the compression ratio.
It can be removed during beta.

* The compression algorithm depends completely upon new row length
savings. If the new row is short, it would seem easier to just skip
the checks and include it anyway. We can say if old and new vary in
length by > 50% of each other, just include new as-is, since the rows
very clearly differ in a big way.

Added a check in heap_delta_encode to identify whether the tuples are differ in length by 50%.

* If full page writes is on and the page is very old, we are just
going to copy the whole block. So why not check for that rather than
do all these push ups and then just copy the page anyway?

Added a function which is used to identify whether the page needs a backup block or not.
based on the result the optimization is applied.

* I'd like to see a specific test in regression that is designed to
exercise the code here. That way we will be certain that the code is
getting regularly tested.

Added the regression tests which covers all the changes done for the optimization except recovery.

* The internal docs are completely absent. We need at least a whole
page of descriptive comment, discussing trade-offs and design
decisions. This is very important because it will help locate bugs
much faster if these things are clealry documented. It also helps
reviewers. This is a big timewaster for committers because you have to
read the whole patch and understand it before you can attempt to form
opinions. Commits happen quicker and easier with good comments.
* Need to mention the WAL format change, or include the change within
the patch so we can see

backend/access/transam/README is updated with details.

* Lots of typos in comments. Many comments say nothing more than the
words already used in the function name itself

corrected the typos and removed unnecessary comments.

* "flags" variables are almost always int or uint in PG source.

* PGLZ_HISTORY_SIZE needs to be documented in the place it is defined,
not the place its used. The test if (oldtuplen < PGLZ_HISTORY_SIZE)
really needs to be a test inside the compression module to maintain
better modularity, so the value itself needn't be exported

(oldtuplen < PGLZ_HISTORY_SIZE) validation is moved inside the heap_delta_encode
and updated the flags variable also.

Test results with modified pgbench (1800 record size) on the latest patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -WAL@-c2-
Head 831 4.17 GB 1416 7.13 GB
WAL modification 846 2.36 GB 1712 3.31 GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -WAL@-c8-
Head 2196 11.01 GB 2758 13.88 GB
WAL modification 3295 5.87 GB 5472 9.02 GB

With Regards,
Amit Kapila.

Attachments:

wal_update_changes_v7.patchapplication/octet-stream; name=wal_update_changes_v7.patchDownload+1295-347
#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#17)

On 9 January 2013 08:05, Amit kapila <amit.kapila@huawei.com> wrote:

Update patch contains handling of below Comments

Thanks

Test results with modified pgbench (1800 record size) on the latest patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -WAL@-c2-
Head 831 4.17 GB 1416 7.13 GB
WAL modification 846 2.36 GB 1712 3.31 GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -WAL@-c8-
Head 2196 11.01 GB 2758 13.88 GB
WAL modification 3295 5.87 GB 5472 9.02 GB

And test results on normal pgbench?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#18)

On Wednesday, January 09, 2013 4:57 PM Simon Riggs wrote:

On 9 January 2013 08:05, Amit kapila <amit.kapila@huawei.com> wrote:

Update patch contains handling of below Comments

Thanks

Test results with modified pgbench (1800 record size) on the latest

patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -

WAL@-c2-

Head 831 4.17 GB 1416 7.13

GB

WAL modification 846 2.36 GB 1712 3.31

GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -

WAL@-c8-

Head 2196 11.01 GB 2758 13.88

GB

WAL modification 3295 5.87 GB 5472 9.02

GB

And test results on normal pgbench?

As there was no gain for original pgbench as was shown in performance
readings, so I thought it is not mandatory.
However I shall run for normal pgbench as it should not lead any further dip
in normal pgbench.
Thanks for pointing.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#18)

On Wednesday, January 09, 2013 4:57 PM Simon Riggs wrote:

On 9 January 2013 08:05, Amit kapila <amit.kapila@huawei.com> wrote:

Update patch contains handling of below Comments

Thanks

Test results with modified pgbench (1800 record size) on the latest

patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -

WAL@-c2-

Head 831 4.17 GB 1416 7.13

GB

WAL modification 846 2.36 GB 1712 3.31

GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -

WAL@-c8-

Head 2196 11.01 GB 2758 13.88

GB

WAL modification 3295 5.87 GB 5472 9.02

GB

And test results on normal pgbench?

configuration:

shared_buffers = 4GB
wal_buffers = 16MB
checkpoint_segments = 256
checkpoint_interval = 15min
autovacuum = off
server_encoding = SQL_ASCII
client_encoding = UTF8
lc_collate = C
lc_ctype = C

init:

pgbench -s 75 -i -F 80

run:

pgbench -T 600

Test results with original pgbench (synccommit off) on the latest patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -WAL@-c2-
Head 1459 1.40 GB 2491 1.70 GB
WAL modification 1558 1.38 GB 2441 1.59 GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -WAL@-c8-
Head 5139 2.49 GB 10651 4.72 GB
WAL modification 5224 2.28 GB 11329 3.96 GB

Test results with original pgbench (synccommit on) on the latest patch:

-Patch- -tps@-c1- -WAL@-c1- -tps@-c2- -WAL@-c2-
Head 146 0.45 GB 167 0.49 GB
WAL modification 144 0.44 GB 166 0.49 GB

-Patch- -tps@-c4- -WAL@-c4- -tps@-c8- -WAL@-c8-
Head 325 0.77 GB 603 1.03 GB
WAL modification 321 0.76 GB 604 1.01 GB

The results are similar as noted by Kyotaro-San. The WAL size is reduced
even for original pgbench.
There is slight performance dip in some of the cases for original pgbench.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#4)
#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#21)
#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#4)
#24Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#10)
#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#23)
#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#24)
#27Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#24)
#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#27)
#29Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#4)
#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#29)
#31Amit Kapila
amit.kapila16@gmail.com
In reply to: Alvaro Herrera (#27)
#32Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#30)
#33Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#31)
#34Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#32)
#35Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#33)
#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#34)
#37Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#36)
#38Simon Riggs
simon@2ndQuadrant.com
In reply to: Amit Kapila (#37)
#39Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#29)
#40Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#38)
#41Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#39)