pglz compression performance, take two

Started by Andrey Borodinover 5 years ago32 messageshackers
Jump to latest
#1Andrey Borodin
amborodin@acm.org

Hi hackers!

A year ago Vladimir Leskov proposed patch to speed up pglz compression[0]/messages/by-id/169163A8-C96F-4DBE-A062-7D1CECBE9E5D@yandex-team.ru. PFA the patch with some editorialisation by me.
I saw some reports of bottlenecking in pglz WAL compression [1]https://smalldatum.blogspot.com/2020/12/tuning-for-insert-benchmark-postgres_4.html.

Hopefully soon we will have compression codecs developed by compression specialists. The work is going on in nearby thread about custom compression methods.
Is it viable to work on pglz optimisation? It's about x1.4 faster. Or should we rely on future use of lz4\zstd and others?

Best regards, Andrey Borodin.

[0]: /messages/by-id/169163A8-C96F-4DBE-A062-7D1CECBE9E5D@yandex-team.ru
[1]: https://smalldatum.blogspot.com/2020/12/tuning-for-insert-benchmark-postgres_4.html

Attachments:

0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=0001-Reorganize-pglz-compression-code.patch; x-unix-mode=0644Download+278-183
#2Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#1)
Re: pglz compression performance, take two

9 дек. 2020 г., в 12:44, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
PFA the patch with some editorialisation by me.
I saw some reports of bottlenecking in pglz WAL compression [1].

I've checked that on my machine simple test
echo "wal_compression = on" >> $PGDATA/postgresql.conf
pgbench -i -s 20 && pgbench -T 30
shows ~2-3% of improvement, but the result is not very stable, deviation is comparable. In fact, bottleneck is just shifted from pglz, thus impact is not that measurable.

I've found out that the patch continues ideas from thread [0]/messages/by-id/5130C914.8080106@vmware.com and commit 031cc55 [1]https://github.com/x4m/postgres_g/commit/031cc55bbea6b3a6b67c700498a78fb1d4399476, but in much more shotgun-surgery way.
Out of curiosity I've rerun tests from that thread
postgres=# with patched as (select testname, avg(seconds) patched from testresults0 group by testname),unpatched as (select testname, avg(seconds) unpatched from testresults group by testname) select * from unpatched join patched using (testname);
testname | unpatched | patched
-------------------+------------------------+------------------------
512b random | 4.5568015000000000 | 4.3512980000000000
100k random | 1.03342300000000000000 | 1.00326200000000000000
100k of same byte | 2.1689715000000000 | 2.0958155000000000
2k random | 3.1613815000000000 | 3.1861350000000000
512b text | 5.7233600000000000 | 5.3602330000000000
5k text | 1.7044835000000000 | 1.8086770000000000
(6 rows)

Results of direct call are somewhat more clear.
Unpatched:
testname | auto
-------------------+-----------
5k text | 1100.705
512b text | 240.585
2k random | 106.865
100k random | 2.663
512b random | 145.736
100k of same byte | 13426.880
(6 rows)

Patched:
testname | auto
-------------------+----------
5k text | 767.535
512b text | 159.076
2k random | 77.126
100k random | 1.698
512b random | 95.768
100k of same byte | 6035.159
(6 rows)

Thanks!

Best regards, Andrey Borodin.

[0]: /messages/by-id/5130C914.8080106@vmware.com
[1]: https://github.com/x4m/postgres_g/commit/031cc55bbea6b3a6b67c700498a78fb1d4399476

#3Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#2)
Re: pglz compression performance, take two

12 дек. 2020 г., в 22:47, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

I've cleaned up comments, checked that memory alignment stuff actually make sense for 32-bit ARM (according to Godbolt) and did some more code cleanup. PFA v2 patch.

I'm still in doubt should I register this patch on CF or not. I'm willing to work on this, but it's not clear will it hit PGv14. And I hope for PGv15 we will have lz4 or something better for WAL compression.

Best regards, Andrey Borodin.

Attachments:

v2-0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=v2-0001-Reorganize-pglz-compression-code.patch; x-unix-mode=0644Download+287-188
#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andrey Borodin (#3)
Re: pglz compression performance, take two

On 12/26/20 8:06 AM, Andrey Borodin wrote:

12 дек. 2020 г., в 22:47, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

I've cleaned up comments, checked that memory alignment stuff actually make sense for 32-bit ARM (according to Godbolt) and did some more code cleanup. PFA v2 patch.

I'm still in doubt should I register this patch on CF or not. I'm willing to work on this, but it's not clear will it hit PGv14. And I hope for PGv15 we will have lz4 or something better for WAL compression.

I'd suggest registering it, otherwise people are much less likely to
give you feedback. I don't see why it couldn't land in PG14.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#4)
Re: pglz compression performance, take two

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

On 12/26/20 8:06 AM, Andrey Borodin wrote:

I'm still in doubt should I register this patch on CF or not. I'm willing to work on this, but it's not clear will it hit PGv14. And I hope for PGv15 we will have lz4 or something better for WAL compression.

I'd suggest registering it, otherwise people are much less likely to
give you feedback. I don't see why it couldn't land in PG14.

Even if lz4 or something else shows up, the existing code will remain
important for TOAST purposes. It would be years before we lose interest
in it.

regards, tom lane

#6Justin Pryzby
pryzby@telsasoft.com
In reply to: Andrey Borodin (#3)
Re: pglz compression performance, take two

On Sat, Dec 26, 2020 at 12:06:59PM +0500, Andrey Borodin wrote:

12 дек. 2020 г., в 22:47, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

I've cleaned up comments, checked that memory alignment stuff actually make sense for 32-bit ARM (according to Godbolt) and did some more code cleanup. PFA v2 patch.

I'm still in doubt should I register this patch on CF or not. I'm willing to work on this, but it's not clear will it hit PGv14. And I hope for PGv15 we will have lz4 or something better for WAL compression.

Thanks for registering it.

There's some typos in the current patch;

farer (further: but it's not your typo)
positiion
reduce a => reduce the
monotonicity what => monotonicity, which
lesser good => less good
allign: align

This comment I couldn't understand:
+        * As initial compare for short matches compares 4 bytes then for the end
+        * of stream length of match should be cut

--
Justin

#7Andrey Borodin
amborodin@acm.org
In reply to: Justin Pryzby (#6)
Re: pglz compression performance, take two

Thanks for looking into this, Justin!

30 дек. 2020 г., в 09:39, Justin Pryzby <pryzby@telsasoft.com> написал(а):

There's some typos in the current patch;

farer (further: but it's not your typo)
positiion
reduce a => reduce the
monotonicity what => monotonicity, which
lesser good => less good
allign: align

Fixed.

This comment I couldn't understand:
+        * As initial compare for short matches compares 4 bytes then for the end
+        * of stream length of match should be cut

I've reworded comments.

Best regards, Andrey Borodin.

Attachments:

v3-0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=v3-0001-Reorganize-pglz-compression-code.patch; x-unix-mode=0644Download+289-188
#8Justin Pryzby
pryzby@telsasoft.com
In reply to: Andrey Borodin (#7)
Re: pglz compression performance, take two

@cfbot: rebased

Attachments:

0001-Reorganize-pglz-compression-code.patchtext/x-diff; charset=us-asciiDownload+288-188
#9Andrey Borodin
amborodin@acm.org
In reply to: Justin Pryzby (#8)
Re: pglz compression performance, take two

22 янв. 2021 г., в 07:48, Justin Pryzby <pryzby@telsasoft.com> написал(а):

@cfbot: rebased
<0001-Reorganize-pglz-compression-code.patch>

Thanks!

I'm experimenting with TPC-C over PostgreSQL 13 on production-like cluster in the cloud. Overall performance is IO-bound, but compression is burning a lot energy too (according to perf top). Cluster consists of 3 nodes(only HA, no standby queries) with 32 vCPU each, 128GB RAM, sync replication, 2000 warehouses, 240GB PGDATA.

Samples: 1M of event 'cpu-clock', 4000 Hz, Event count (approx.): 177958545079
Overhead Shared Object Symbol
18.36% postgres [.] pglz_compress
3.88% [kernel] [k] _raw_spin_unlock_irqrestore
3.39% postgres [.] hash_search_with_hash_value
3.00% [kernel] [k] finish_task_switch
2.03% [kernel] [k] copy_user_enhanced_fast_string
1.14% [kernel] [k] filemap_map_pages
1.02% postgres [.] AllocSetAlloc
0.93% postgres [.] _bt_compare
0.89% postgres [.] PinBuffer
0.82% postgres [.] SearchCatCache1
0.79% postgres [.] LWLockAttemptLock
0.78% postgres [.] GetSnapshotData

Overall cluster runs 862tps (52KtpmC, though only 26KtmpC is qualified on 2K warehouses).

Thanks!

Best regards, Andrey Borodin.

#10Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Justin Pryzby (#8)
Re: pglz compression performance, take two

On Jan 21, 2021, at 6:48 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:

@cfbot: rebased
<0001-Reorganize-pglz-compression-code.patch>

Review comments.

First, I installed a build from master without this patch, created a test installation with lots of compressed text and array columns, upgraded the binaries to a build with this patch included, and tried to find problems with the data left over from the pre-patch binaries. Everything checks out. This is on little-endian mac osx intel core i9, not on any ARM platform that you are targeting with portions of the patch.

+/**************************************
+ *  CPU Feature Detection            *
+ **************************************/
+/* PGLZ_FORCE_MEMORY_ACCESS
+ * By default, access to unaligned memory is controlled by `memcpy()`, which is safe and portable.
+ * Unfortunately, on some target/compiler combinations, the generated assembly is sub-optimal.
+ * The below switch allow to select different access method for improved performance.
+ * Method 0 (default) : use `memcpy()`. Safe and portable.
+ * Method 1 : direct access. This method is portable but violate C standard.
+ *         It can generate buggy code on targets which assembly generation depends on alignment.
+ *         But in some circumstances, it's the only known way to get the most performance (ie GCC + ARMv6)
+ * See https://fastcompression.blogspot.fr/2015/08/accessing-unaligned-memory.html for details.
+ * Prefer these methods in priority order (0 > 1)
+ */

The link to blogspot.fr has a lot more information than your summary in the code comments. It might be hard to understand this comment if the blogspot article were ever to disappear. Perhaps you could include a bit more of the relevant details?

+#ifndef PGLZ_FORCE_MEMORY_ACCESS   /* can be defined externally */
+#if defined(__GNUC__) && \
+  ( defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || defined(__ARM_ARCH_6K__) \
+  || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) )
+#define PGLZ_FORCE_MEMORY_ACCESS 1
+#endif
+#endif

I can understand wanting to set this on gcc + ARMv6, but doesn't this belong in a configure script rather than directly in the compression code?

The blogspot article indicates that the author lied about alignment to the compiler when using gcc on ARMv6, thereby generating a fast load instruction which happens to work on ARMv6. You appear to be using that same approach. Your #if defined(__GNUC__), seems to assume that all future versions of gcc will generate the instructions that you want, and not start generating some other set of instructions. Wouldn't you at least need a configure test to verify that the version of gcc being used generates the desired assembly? Even then, you'd be banking on gcc doing the same thing for the test code and for the pglz code, which I guess might not be true. Have you considered using inline assembly instead?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Andrey Borodin (#9)
Re: pglz compression performance, take two

On Jan 28, 2021, at 2:56 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

22 янв. 2021 г., в 07:48, Justin Pryzby <pryzby@telsasoft.com> написал(а):

@cfbot: rebased
<0001-Reorganize-pglz-compression-code.patch>

Thanks!

I'm experimenting with TPC-C over PostgreSQL 13 on production-like cluster in the cloud. Overall performance is IO-bound, but compression is burning a lot energy too (according to perf top). Cluster consists of 3 nodes(only HA, no standby queries) with 32 vCPU each, 128GB RAM, sync replication, 2000 warehouses, 240GB PGDATA.

Samples: 1M of event 'cpu-clock', 4000 Hz, Event count (approx.): 177958545079
Overhead Shared Object Symbol
18.36% postgres [.] pglz_compress
3.88% [kernel] [k] _raw_spin_unlock_irqrestore
3.39% postgres [.] hash_search_with_hash_value
3.00% [kernel] [k] finish_task_switch
2.03% [kernel] [k] copy_user_enhanced_fast_string
1.14% [kernel] [k] filemap_map_pages
1.02% postgres [.] AllocSetAlloc
0.93% postgres [.] _bt_compare
0.89% postgres [.] PinBuffer
0.82% postgres [.] SearchCatCache1
0.79% postgres [.] LWLockAttemptLock
0.78% postgres [.] GetSnapshotData

Overall cluster runs 862tps (52KtpmC, though only 26KtmpC is qualified on 2K warehouses).

Thanks!

Robert Haas just committed Dilip Kumar's LZ4 compression, bbe0a81db69bd10bd166907c3701492a29aca294.

Is this pglz compression patch still relevant? How does the LZ4 compression compare on your hardware?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Justin Pryzby
pryzby@telsasoft.com
In reply to: Mark Dilger (#11)
Re: pglz compression performance, take two

On Fri, Mar 19, 2021 at 01:29:14PM -0700, Mark Dilger wrote:

Robert Haas just committed Dilip Kumar's LZ4 compression, bbe0a81db69bd10bd166907c3701492a29aca294.

Is this pglz compression patch still relevant? How does the LZ4 compression compare on your hardware?

I think it's still relevant, since many people may not end up with binaries
--with-lz4 (I'm thinking of cloud providers). PGLZ is what existing data uses,
and people may not want to/know to migrate to shiny new features, but they'd
like it if their queries were 20% faster after upgrading without needing to.

Also, Dilip's patch is only for TOAST compression, and pglz is also being used
for wal_compression - Andrey has a short patch to implement lz4 for that:
https://commitfest.postgresql.org/32/3015/

--
Justin

#13Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#12)
Re: pglz compression performance, take two

On Sat, Mar 20, 2021 at 12:19:45AM -0500, Justin Pryzby wrote:

I think it's still relevant, since many people may not end up with binaries
--with-lz4 (I'm thinking of cloud providers). PGLZ is what existing data uses,
and people may not want to/know to migrate to shiny new features, but they'd
like it if their queries were 20% faster after upgrading without needing to.

Yeah, I agree that local improvements here are relevant, particularly
as we don't enforce the rewrite of toast data already compressed with
pglz. So, we still need to stick with pglz for some time.
--
Michael

#14Andrey Borodin
amborodin@acm.org
In reply to: Mark Dilger (#10)
Re: pglz compression performance, take two

20 марта 2021 г., в 00:35, Mark Dilger <mark.dilger@enterprisedb.com> написал(а):

On Jan 21, 2021, at 6:48 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:

@cfbot: rebased
<0001-Reorganize-pglz-compression-code.patch>

Review comments.

Thanks for the review, Mark!
And sorry for such a long delay, I've been trying to figure out a way to do things less-platform dependent.
And here's what I've come up with.

We use pglz_read32() not the way xxhash and lz4 does - we really do not need to get 4-byte value, we only need to compare 4 bytes at once.
So, essentially, we need to compare two implementation of 4-byte comparison

bool
cpm_a(const void *ptr1, const void *ptr2)
{
return *(const uint32_t *) ptr1 == *(const uint32_t *) ptr2;
}

bool
cmp_b(const void *ptr1, const void *ptr2)
{
return memcmp(ptr1, ptr2, 4) == 0;
}

Variant B is more portable. Inspecting it Godblot's compiler explorer I've found out that for GCC 7.1+ it generates assembly without memcmp() call. For x86-64 and ARM64 assembly of cmp_b is identical to cmp_a.
So I think maybe we could just stick with version cmp_b instead of optimising for ARM6 and similar architectures like Arduino.

I've benchmarked the patch with "REINDEX table pgbench_accounts" on pgbench -i of scale 100. wal_compression was on, other settings were default.
Without patch it takes ~11055.077 ms on my machine, with patch it takes ~9512.411 ms, 14% speedup overall.

PFA v5.

Thanks!

Best regards, Andrey Borodin.

Attachments:

v5-0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=v5-0001-Reorganize-pglz-compression-code.patch; x-unix-mode=0644Download+249-188
#15Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Andrey Borodin (#14)
Re: pglz compression performance, take two

On Jun 27, 2021, at 3:41 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

And here's what I've come up with.

I have not tested the patch yet, but here are some quick review comments:

#define PGLZ_HISTORY_SIZE 0x0fff - 1 /* to avoid compare in iteration */

...

static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];

...

if (hist_next == PGLZ_HISTORY_SIZE + 1)

These are the only uses of PGLZ_HISTORY_SIZE. Perhaps you could just defined the symbol as 0x0fff and skip the -1 and +1 business?

/* ----------
* pglz_compare -
*
* Compares 4 bytes at pointers
* ----------
*/
static inline bool
pglz_compare32(const void *ptr1, const void *ptr2)
{
return memcmp(ptr1, ptr2, 4) == 0;
}

The comment function name differs from the actual function name.

Also, pglz_compare returns an offset into the string, whereas pglz_compare32 returns a boolean. This is fairly unintuitive. The "32" part of pglz_compare32 implies doing the same thing as pglz_compare but where the string is known to be 4 bytes in length. Given that pglz_compare32 is dissimilar to pglz_compare, perhaps avoid using /pglz_compare/ in its name?

/*
* Determine length of match. A better match must be larger than the
* best so far. And if we already have a match of 16 or more bytes,
* it's worth the call overhead to use memcmp()

This comment is hard to understand, given the code that follows. The first block calls memcmp(), which seems to be the function overhead you refer to. The second block calls the static inline function pglz_compare32, which internally calls memcmp(). Superficially, there seems to be a memcmp() function call either way. The difference is that in the first block's call to memcmp(), the length is a runtime value, and in the second block, it is a compile-time known value. If you are depending on the compiler to notice this distinction and optimize the second call, perhaps you can mention that explicitly? Otherwise, reading and understanding the comment takes more effort.

I took a quick look for other places in the code that try to beat the performance of memcmp on short strings. In varlena.c, rest_of_char_same() seems to do so. We also use comparisons on NameData, which frequently contains strings shorter than 16 bytes. Is it worth sharting a static inline function that uses your optimization in other places? How confident are you that your optimization really helps?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#15)
Re: pglz compression performance, take two

On Jun 28, 2021, at 9:05 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Is it worth sharting a static inline function that uses your optimization in other places?

s/sharting/sharing/

How confident are you that your optimization really helps?

By which I mean, is the optimization worth the extra branch checking if (len >= 16)? Is the 14% speedup you report dependent on this extra complexity?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Mark Dilger (#16)
Re: pglz compression performance, take two

Hi,

I've looked at this patch again and did some testing. I don't have any
comments to the code (I see there are two comments from Mark after the
last version, though).

For the testing, I did a fairly simple benchmark loading either random
or compressible data into a bytea column. The tables are defined as
unlogged, the values are 1kB, 4kB and 1MB, and the total amount of data
is always 1GB. The timings are

test master patched delta
------------------------------------------
random_1k 12295 12312 100%
random_1m 12999 12984 100%
random_4k 16881 15959 95%
redundant_1k 12308 12348 100%
redundant_1m 16632 14072 85%
redundant_4k 16798 13828 82%

I ran the test on multiple x86_64 machines, but the behavior is almost
exactly the same.

This shows there's no difference for 1kB (expected, because this does
not exceed the ~2kB TOAST threshold). For random data in general the
difference is pretty negligible, although it's a bit strange it takes
longer for 4kB than 1MB ones.

For redundant (highly compressible) values, there's quite significant
speedup between 15-18%. Real-world data are likely somewhere between,
but the speedup is still pretty nice.

Andrey, can you update the patch per Mark's review? I'll do my best to
get it committed sometime in this CF.

Attached are the two scripts used for generating / testing (you'll have
to fix some hardcoded paths, but simple otherwise).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

generator.sqlapplication/sql; name=generator.sqlDownload
test.shapplication/x-shellscript; name=test.shDownload
#18Andrey Borodin
amborodin@acm.org
In reply to: Tomas Vondra (#17)
Re: pglz compression performance, take two

Thanks for the review Mark! Sorry it took too long to reply on my side.

28 июня 2021 г., в 21:05, Mark Dilger <mark.dilger@enterprisedb.com> написал(а):

#define PGLZ_HISTORY_SIZE 0x0fff - 1 /* to avoid compare in iteration */

...

static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];

...

if (hist_next == PGLZ_HISTORY_SIZE + 1)

These are the only uses of PGLZ_HISTORY_SIZE. Perhaps you could just defined the symbol as 0x0fff and skip the -1 and +1 business?

Fixed.

/* ----------
* pglz_compare -
*
* Compares 4 bytes at pointers
* ----------
*/
static inline bool
pglz_compare32(const void *ptr1, const void *ptr2)
{
return memcmp(ptr1, ptr2, 4) == 0;
}

The comment function name differs from the actual function name.

Also, pglz_compare returns an offset into the string, whereas pglz_compare32 returns a boolean. This is fairly unintuitive. The "32" part of pglz_compare32 implies doing the same thing as pglz_compare but where the string is known to be 4 bytes in length. Given that pglz_compare32 is dissimilar to pglz_compare, perhaps avoid using /pglz_compare/ in its name?

I've removed pglz_compare32 entirely. It was a simple substitution for memcmp().

/*
* Determine length of match. A better match must be larger than the
* best so far. And if we already have a match of 16 or more bytes,
* it's worth the call overhead to use memcmp()

This comment is hard to understand, given the code that follows. The first block calls memcmp(), which seems to be the function overhead you refer to. The second block calls the static inline function pglz_compare32, which internally calls memcmp(). Superficially, there seems to be a memcmp() function call either way. The difference is that in the first block's call to memcmp(), the length is a runtime value, and in the second block, it is a compile-time known value. If you are depending on the compiler to notice this distinction and optimize the second call, perhaps you can mention that explicitly? Otherwise, reading and understanding the comment takes more effort.

I've updated comment for second branch with fixed-size memcmp(). Frankly, I'm not sure "if (memcmp(input_pos, hist_pos, 4) == 0)" worth the complexity, internals of "pglz_compare(0, len_bound, input_pos + 0, hist_pos + 0);" would do almost same instructions.

I took a quick look for other places in the code that try to beat the performance of memcmp on short strings. In varlena.c, rest_of_char_same() seems to do so. We also use comparisons on NameData, which frequently contains strings shorter than 16 bytes. Is it worth sharting a static inline function that uses your optimization in other places? How confident are you that your optimization really helps?

Honestly, I do not know. The overall patch effect consists of stacking up many small optimizations. They have a net effect, but are too noisy to measure independently. That's mostly the reason why I didn't know what to reply for so long.

5 нояб. 2021 г., в 01:47, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а):

Andrey, can you update the patch per Mark's review? I'll do my best to get it committed sometime in this CF.

Cool! Here's the patch.

Best regards, Andrey Borodin.

Attachments:

v6-0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=v6-0001-Reorganize-pglz-compression-code.patch; x-unix-mode=0644Download+242-191
#19Ian Lawrence Barwick
barwick@gmail.com
In reply to: Andrey Borodin (#18)
Re: pglz compression performance, take two

2021年11月5日(金) 14:51 Andrey Borodin <x4mmm@yandex-team.ru>:

Thanks for the review Mark! Sorry it took too long to reply on my side.

28 июня 2021 г., в 21:05, Mark Dilger <mark.dilger@enterprisedb.com> написал(а):

#define PGLZ_HISTORY_SIZE 0x0fff - 1 /* to avoid compare in iteration */

...

static PGLZ_HistEntry hist_entries[PGLZ_HISTORY_SIZE + 1];

...

if (hist_next == PGLZ_HISTORY_SIZE + 1)

These are the only uses of PGLZ_HISTORY_SIZE. Perhaps you could just defined the symbol as 0x0fff and skip the -1 and +1 business?

Fixed.

/* ----------
* pglz_compare -
*
* Compares 4 bytes at pointers
* ----------
*/
static inline bool
pglz_compare32(const void *ptr1, const void *ptr2)
{
return memcmp(ptr1, ptr2, 4) == 0;
}

The comment function name differs from the actual function name.

Also, pglz_compare returns an offset into the string, whereas pglz_compare32 returns a boolean. This is fairly unintuitive. The "32" part of pglz_compare32 implies doing the same thing as pglz_compare but where the string is known to be 4 bytes in length. Given that pglz_compare32 is dissimilar to pglz_compare, perhaps avoid using /pglz_compare/ in its name?

I've removed pglz_compare32 entirely. It was a simple substitution for memcmp().

/*
* Determine length of match. A better match must be larger than the
* best so far. And if we already have a match of 16 or more bytes,
* it's worth the call overhead to use memcmp()

This comment is hard to understand, given the code that follows. The first block calls memcmp(), which seems to be the function overhead you refer to. The second block calls the static inline function pglz_compare32, which internally calls memcmp(). Superficially, there seems to be a memcmp() function call either way. The difference is that in the first block's call to memcmp(), the length is a runtime value, and in the second block, it is a compile-time known value. If you are depending on the compiler to notice this distinction and optimize the second call, perhaps you can mention that explicitly? Otherwise, reading and understanding the comment takes more effort.

I've updated comment for second branch with fixed-size memcmp(). Frankly, I'm not sure "if (memcmp(input_pos, hist_pos, 4) == 0)" worth the complexity, internals of "pglz_compare(0, len_bound, input_pos + 0, hist_pos + 0);" would do almost same instructions.

I took a quick look for other places in the code that try to beat the performance of memcmp on short strings. In varlena.c, rest_of_char_same() seems to do so. We also use comparisons on NameData, which frequently contains strings shorter than 16 bytes. Is it worth sharting a static inline function that uses your optimization in other places? How confident are you that your optimization really helps?

Honestly, I do not know. The overall patch effect consists of stacking up many small optimizations. They have a net effect, but are too noisy to measure independently. That's mostly the reason why I didn't know what to reply for so long.

5 нояб. 2021 г., в 01:47, Tomas Vondra <tomas.vondra@enterprisedb.com> написал(а):

Andrey, can you update the patch per Mark's review? I'll do my best to get it committed sometime in this CF.

Cool! Here's the patch.

HI!

This patch is marked as "Ready for Committer" in the current commitfest [1]https://commitfest.postgresql.org/40/2897/
but has seen no further activity for more than a year, Given that it's
on its 10th
commitfest, it would be useful to clarify its status one way or the other.

[1]: https://commitfest.postgresql.org/40/2897/

Thanks

Ian Barwick

#20Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ian Lawrence Barwick (#19)
Re: pglz compression performance, take two

Hi,

I took a look at the v6 patch, with the intention to get it committed. I
have a couple minor comments:

1) For PGLZ_HISTORY_SIZE it uses literal 0x0fff, with the explanation:

/* to avoid compare in iteration */

which I think means intent to use this value as a bit mask, but then the
only place using PGLZ_HISTORY_SIZE does

if (hist_next == PGLZ_HISTORY_SIZE) ...

i.e. a comparison. Maybe I misunderstand the comment, though.

2) PGLZ_HistEntry was modified and replaces links (pointers) with
indexes, but the comments still talk about "links", so maybe that needs
to be changed. Also, I wonder why next_id is int16 while hist_idx is
uint16 (and also why id vs. idx)?

3) minor formatting of comments

4) the comment in pglz_find_match about traversing the history seems too
early - it's before handling invalid entries and cleanup, but it does
not talk about that at all, and the actual while loop is after that.

Attached is v6 in 0001 (verbatim), with the review comments in 0002.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Reorganize-pglz-compression-code-v6.patchtext/x-patch; charset=UTF-8; name=0001-Reorganize-pglz-compression-code-v6.patchDownload+242-191
0002-review-v6.patchtext/x-patch; charset=UTF-8; name=0002-review-v6.patchDownload+8-7
#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#20)
#22Andrey Borodin
amborodin@acm.org
In reply to: Tomas Vondra (#20)
#23Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#22)
#24Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#23)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andrey Borodin (#24)
#26Andrey Borodin
amborodin@acm.org
In reply to: Tomas Vondra (#25)
#27Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andrey Borodin (#26)
#28Andrey Borodin
amborodin@acm.org
In reply to: Tomas Vondra (#27)
#29Andres Freund
andres@anarazel.de
In reply to: Andrey Borodin (#24)
#30Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#29)
#31Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#30)
#32Andrey Borodin
amborodin@acm.org
In reply to: Andres Freund (#31)