pglz performance

Started by Andrey Borodinalmost 7 years ago69 messageshackers
Jump to latest
#1Andrey Borodin
amborodin@acm.org

Hi hackers!

I was reviewing Paul Ramsey's TOAST patch[0]/messages/by-id/CANP8+jKcGj-JYzEawS+CUZnfeGKq4T5LswcswMP4GUHeZEP1ag@mail.gmail.com and noticed that there is a big room for improvement in performance of pglz compression and decompression.

With Vladimir we started to investigate ways to boost byte copying and eventually created test suit[1]https://github.com/x4m/test_pglz to investigate performance of compression and decompression.
This is and extension with single function test_pglz() which performs tests for different:
1. Data payloads
2. Compression implementations
3. Decompression implementations

Currently we test mostly decompression improvements against two WALs and one data file taken from pgbench-generated database. Any suggestion on more relevant data payloads are very welcome.
My laptop tests show that our decompression implementation [2]/messages/by-id/C2D8E5D5-3E83-469B-8751-1C7877C2A5F2@yandex-team.ru can be from 15% to 50% faster.
Also I've noted that compression is extremely slow, ~30 times slower than decompression. I believe we can do something about it.

We focus only on boosting existing codec without any considerations of other compression algorithms.

Any comments are much appreciated.

Most important questions are:
1. What are relevant data sets?
2. What are relevant CPUs? I have only XEON-based servers and few laptops\desktops with intel CPUs
3. If compression is 30 times slower, should we better focus on compression instead of decompression?

Best regards, Andrey Borodin.

[0]: /messages/by-id/CANP8+jKcGj-JYzEawS+CUZnfeGKq4T5LswcswMP4GUHeZEP1ag@mail.gmail.com
[1]: https://github.com/x4m/test_pglz
[2]: /messages/by-id/C2D8E5D5-3E83-469B-8751-1C7877C2A5F2@yandex-team.ru

#2Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#1)
Re: pglz performance

On Mon, May 13, 2019 at 07:45:59AM +0500, Andrey Borodin wrote:

I was reviewing Paul Ramsey's TOAST patch[0] and noticed that there
is a big room for improvement in performance of pglz compression and
decompression.

Yes, I believe so too. pglz is a huge CPU-consumer when it comes to
compilation compared to more modern algos like lz4.

With Vladimir we started to investigate ways to boost byte copying
and eventually created test suit[1] to investigate performance of
compression and decompression. This is and extension with single
function test_pglz() which performs tests for different:
1. Data payloads
2. Compression implementations
3. Decompression implementations

Cool. I got something rather similar in my wallet of plugins:
https://github.com/michaelpq/pg_plugins/tree/master/compress_test
This is something I worked on mainly for FPW compression in WAL.

Currently we test mostly decompression improvements against two WALs
and one data file taken from pgbench-generated database. Any
suggestion on more relevant data payloads are very welcome.

Text strings made of random data and variable length? For any test of
this kind I think that it is good to focus on the performance of the
low-level calls, even going as far as a simple C wrapper on top of the
pglz APIs to test only the performance and not have extra PG-related
overhead like palloc() which can be a barrier. Focusing on strings of
lengths of 1kB up to 16kB may be an idea of size, and it is important
to keep the same uncompressed strings for performance comparison.

My laptop tests show that our decompression implementation [2] can
be from 15% to 50% faster. Also I've noted that compression is
extremely slow, ~30 times slower than decompression. I believe we
can do something about it.

That's nice.

We focus only on boosting existing codec without any considerations
of other compression algorithms.

There is this as well. A couple of algorithms have a license
compatible with Postgres, but it may be more simple to just improve
pglz. A 10%~20% improvement is something worth doing.

Most important questions are:
1. What are relevant data sets?
2. What are relevant CPUs? I have only XEON-based servers and few
laptops\desktops with intel CPUs
3. If compression is 30 times slower, should we better focus on
compression instead of decompression?

Decompression can matter a lot for mostly-read workloads and
compression can become a bottleneck for heavy-insert loads, so
improving compression or decompression should be two separate
problems, not two problems linked. Any improvement in one or the
other, or even both, is nice to have.
--
Michael

#3Andrey Borodin
amborodin@acm.org
In reply to: Michael Paquier (#2)
Re: pglz performance

13 мая 2019 г., в 12:14, Michael Paquier <michael@paquier.xyz> написал(а):

Currently we test mostly decompression improvements against two WALs
and one data file taken from pgbench-generated database. Any
suggestion on more relevant data payloads are very welcome.

Text strings made of random data and variable length?

Like text corpus?

For any test of
this kind I think that it is good to focus on the performance of the
low-level calls, even going as far as a simple C wrapper on top of the
pglz APIs to test only the performance and not have extra PG-related
overhead like palloc() which can be a barrier.

Our test_pglz extension is measuring only time of real compression, doing warmup run, all allocations are done before measurement.

Focusing on strings of
lengths of 1kB up to 16kB may be an idea of size, and it is important
to keep the same uncompressed strings for performance comparison.

We intentionally avoid using generated data, thus keep test files committed into git repo.
Also we check that decompressed data matches source of compression. All tests are done 5 times.

We use PG extension only for simplicity of deployment of benchmarks to our PG clusters.

Here are some test results.

Currently we test on 4 payloads:
1. WAL from cluster initialization
2. 2 WALs from pgbench pgbench -i -s 10
3. data file taken from pgbench -i -s 10

We use these decompressors:
1. pglz_decompress_vanilla - taken from PG source code
2. pglz_decompress_hacked - use sliced memcpy to imitate byte-by-byte pglz decompression
3. pglz_decompress_hacked4, pglz_decompress_hacked8, pglz_decompress_hackedX - use memcpy if match is no less than X bytes. We need to determine best X, if this approach is used.

I used three platforms:
1. Server XEONE5-2660 SM/SYS1027RN3RF/10S2.5/1U/2P (2*INTEL XEON E5-2660/16*DDR3ECCREG/10*SAS-2.5) Under Ubuntu 14, PG 9.6.
2. Desktop Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz Ubuntu 18, PG 12devel
3. Laptop MB Pro 15 2015 2.2 GHz Core i7 (I7-4770HQ) MacOS, PG 12devel
Owners of AMD and ARM devices are welcome.

Server results (less is better):
NOTICE: 00000: Time to decompress one byte in ns:
NOTICE: 00000: Payload 000000010000000000000001
NOTICE: 00000: Decompressor pglz_decompress_hacked result 0.647235
NOTICE: 00000: Decompressor pglz_decompress_hacked4 result 0.671029
NOTICE: 00000: Decompressor pglz_decompress_hacked8 result 0.699949
NOTICE: 00000: Decompressor pglz_decompress_hacked16 result 0.739586
NOTICE: 00000: Decompressor pglz_decompress_hacked32 result 0.787926
NOTICE: 00000: Decompressor pglz_decompress_vanilla result 1.147282
NOTICE: 00000: Payload 000000010000000000000006
NOTICE: 00000: Decompressor pglz_decompress_hacked result 0.201774
NOTICE: 00000: Decompressor pglz_decompress_hacked4 result 0.211859
NOTICE: 00000: Decompressor pglz_decompress_hacked8 result 0.212610
NOTICE: 00000: Decompressor pglz_decompress_hacked16 result 0.214601
NOTICE: 00000: Decompressor pglz_decompress_hacked32 result 0.221813
NOTICE: 00000: Decompressor pglz_decompress_vanilla result 0.706005
NOTICE: 00000: Payload 000000010000000000000008
NOTICE: 00000: Decompressor pglz_decompress_hacked result 1.370132
NOTICE: 00000: Decompressor pglz_decompress_hacked4 result 1.388991
NOTICE: 00000: Decompressor pglz_decompress_hacked8 result 1.388502
NOTICE: 00000: Decompressor pglz_decompress_hacked16 result 1.529455
NOTICE: 00000: Decompressor pglz_decompress_hacked32 result 1.520813
NOTICE: 00000: Decompressor pglz_decompress_vanilla result 1.433527
NOTICE: 00000: Payload 16398
NOTICE: 00000: Decompressor pglz_decompress_hacked result 0.606943
NOTICE: 00000: Decompressor pglz_decompress_hacked4 result 0.623044
NOTICE: 00000: Decompressor pglz_decompress_hacked8 result 0.624118
NOTICE: 00000: Decompressor pglz_decompress_hacked16 result 0.620987
NOTICE: 00000: Decompressor pglz_decompress_hacked32 result 0.621183
NOTICE: 00000: Decompressor pglz_decompress_vanilla result 1.365318

Comment: pglz_decompress_hacked is unconditionally optimal. On most of cases it is 2x better than current implementation.
On 000000010000000000000008 it is only marginally better. pglz_decompress_hacked8 is few percents worse than pglz_decompress_hacked.

Desktop results:
NOTICE: Time to decompress one byte in ns:
NOTICE: Payload 000000010000000000000001
NOTICE: Decompressor pglz_decompress_hacked result 0.396454
NOTICE: Decompressor pglz_decompress_hacked4 result 0.429249
NOTICE: Decompressor pglz_decompress_hacked8 result 0.436413
NOTICE: Decompressor pglz_decompress_hacked16 result 0.478077
NOTICE: Decompressor pglz_decompress_hacked32 result 0.491488
NOTICE: Decompressor pglz_decompress_vanilla result 0.695527
NOTICE: Payload 000000010000000000000006
NOTICE: Decompressor pglz_decompress_hacked result 0.110710
NOTICE: Decompressor pglz_decompress_hacked4 result 0.115669
NOTICE: Decompressor pglz_decompress_hacked8 result 0.127637
NOTICE: Decompressor pglz_decompress_hacked16 result 0.120544
NOTICE: Decompressor pglz_decompress_hacked32 result 0.117981
NOTICE: Decompressor pglz_decompress_vanilla result 0.399446
NOTICE: Payload 000000010000000000000008
NOTICE: Decompressor pglz_decompress_hacked result 0.647402
NOTICE: Decompressor pglz_decompress_hacked4 result 0.691891
NOTICE: Decompressor pglz_decompress_hacked8 result 0.693834
NOTICE: Decompressor pglz_decompress_hacked16 result 0.776815
NOTICE: Decompressor pglz_decompress_hacked32 result 0.777960
NOTICE: Decompressor pglz_decompress_vanilla result 0.721192
NOTICE: Payload 16398
NOTICE: Decompressor pglz_decompress_hacked result 0.337654
NOTICE: Decompressor pglz_decompress_hacked4 result 0.355452
NOTICE: Decompressor pglz_decompress_hacked8 result 0.351224
NOTICE: Decompressor pglz_decompress_hacked16 result 0.362548
NOTICE: Decompressor pglz_decompress_hacked32 result 0.356456
NOTICE: Decompressor pglz_decompress_vanilla result 0.837042

Comment: identical to Server results.

Laptop results:
NOTICE: Time to decompress one byte in ns:
NOTICE: Payload 000000010000000000000001
NOTICE: Decompressor pglz_decompress_hacked result 0.661469
NOTICE: Decompressor pglz_decompress_hacked4 result 0.638366
NOTICE: Decompressor pglz_decompress_hacked8 result 0.664377
NOTICE: Decompressor pglz_decompress_hacked16 result 0.696135
NOTICE: Decompressor pglz_decompress_hacked32 result 0.634825
NOTICE: Decompressor pglz_decompress_vanilla result 0.676560
NOTICE: Payload 000000010000000000000006
NOTICE: Decompressor pglz_decompress_hacked result 0.213921
NOTICE: Decompressor pglz_decompress_hacked4 result 0.224864
NOTICE: Decompressor pglz_decompress_hacked8 result 0.229394
NOTICE: Decompressor pglz_decompress_hacked16 result 0.218141
NOTICE: Decompressor pglz_decompress_hacked32 result 0.220954
NOTICE: Decompressor pglz_decompress_vanilla result 0.242412
NOTICE: Payload 000000010000000000000008
NOTICE: Decompressor pglz_decompress_hacked result 1.053417
NOTICE: Decompressor pglz_decompress_hacked4 result 1.063704
NOTICE: Decompressor pglz_decompress_hacked8 result 1.007211
NOTICE: Decompressor pglz_decompress_hacked16 result 1.145089
NOTICE: Decompressor pglz_decompress_hacked32 result 1.079702
NOTICE: Decompressor pglz_decompress_vanilla result 1.051557
NOTICE: Payload 16398
NOTICE: Decompressor pglz_decompress_hacked result 0.251690
NOTICE: Decompressor pglz_decompress_hacked4 result 0.268125
NOTICE: Decompressor pglz_decompress_hacked8 result 0.269248
NOTICE: Decompressor pglz_decompress_hacked16 result 0.277880
NOTICE: Decompressor pglz_decompress_hacked32 result 0.270290
NOTICE: Decompressor pglz_decompress_vanilla result 0.705652

Comment: decompress time on WAL segments is statistically indistinguishable between hacked and original versions. Hacked decompression of data file is 2x faster.

We are going to try these tests on cascade lake processors too.

Best regards, Andrey Borodin.

#4Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#3)
Re: pglz performance

15 мая 2019 г., в 15:06, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

Owners of AMD and ARM devices are welcome.

Yandex hardware RND guys gave me ARM server and Power9 server. They are looking for AMD and some new Intel boxes.

Meanwhile I made some enhancements to test suit:
1. I've added Shakespeare payload: concatenation of works of this prominent poet.
2. For each payload compute "sliced time" - time to decompress payload if it was sliced by 2Kb pieces or 8Kb pieces.
3. For each decompressor we compute "score": (sum of time to decompress each payload, each payload sliced by 2Kb and 8Kb) * 5 times

I've attached full test logs, meanwhile here's results for different platforms.

Intel Server
NOTICE: 00000: Decompressor pglz_decompress_hacked result 10.346763
NOTICE: 00000: Decompressor pglz_decompress_hacked8 result 11.192078
NOTICE: 00000: Decompressor pglz_decompress_hacked16 result 11.957727
NOTICE: 00000: Decompressor pglz_decompress_vanilla result 14.262256

ARM Server
NOTICE: Decompressor pglz_decompress_hacked result 12.966668
NOTICE: Decompressor pglz_decompress_hacked8 result 13.004935
NOTICE: Decompressor pglz_decompress_hacked16 result 13.043015
NOTICE: Decompressor pglz_decompress_vanilla result 18.239242

Power9 Server
NOTICE: Decompressor pglz_decompress_hacked result 10.992974
NOTICE: Decompressor pglz_decompress_hacked8 result 11.747443
NOTICE: Decompressor pglz_decompress_hacked16 result 11.026342
NOTICE: Decompressor pglz_decompress_vanilla result 16.375315

Intel laptop
NOTICE: Decompressor pglz_decompress_hacked result 9.445808
NOTICE: Decompressor pglz_decompress_hacked8 result 9.105360
NOTICE: Decompressor pglz_decompress_hacked16 result 9.621833
NOTICE: Decompressor pglz_decompress_vanilla result 10.661968

From these results pglz_decompress_hacked looks best.

Best regards, Andrey Borodin.

Attachments:

pglz_benchmarks.txttext/plain; name=pglz_benchmarks.txt; x-unix-mode=0644Download
#5Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#4)
Re: pglz performance

On Thu, May 16, 2019 at 10:13:22PM +0500, Andrey Borodin wrote:

Meanwhile I made some enhancements to test suit:
Intel Server
NOTICE: 00000: Decompressor pglz_decompress_hacked result 10.346763
NOTICE: 00000: Decompressor pglz_decompress_hacked8 result 11.192078
NOTICE: 00000: Decompressor pglz_decompress_hacked16 result 11.957727
NOTICE: 00000: Decompressor pglz_decompress_vanilla result 14.262256

ARM Server
NOTICE: Decompressor pglz_decompress_hacked result 12.966668
NOTICE: Decompressor pglz_decompress_hacked8 result 13.004935
NOTICE: Decompressor pglz_decompress_hacked16 result 13.043015
NOTICE: Decompressor pglz_decompress_vanilla result 18.239242

Power9 Server
NOTICE: Decompressor pglz_decompress_hacked result 10.992974
NOTICE: Decompressor pglz_decompress_hacked8 result 11.747443
NOTICE: Decompressor pglz_decompress_hacked16 result 11.026342
NOTICE: Decompressor pglz_decompress_vanilla result 16.375315

Intel laptop
NOTICE: Decompressor pglz_decompress_hacked result 9.445808
NOTICE: Decompressor pglz_decompress_hacked8 result 9.105360
NOTICE: Decompressor pglz_decompress_hacked16 result 9.621833
NOTICE: Decompressor pglz_decompress_vanilla result 10.661968

From these results pglz_decompress_hacked looks best.

That's nice.

From the numbers you are presenting here, all of them are much better
than the original, and there is not much difference between any of the
patched versions. Having a 20%~30% improvement with a patch is very
nice.

After that comes the simplicity and the future maintainability of what
is proposed. I am not much into accepting a patch which has a 1%~2%
impact for some hardwares and makes pglz much more complex and harder
to understand. But I am really eager to see a patch with at least a
10% improvement which remains simple, even more if it simplifies the
logic used in pglz.
--
Michael

#6Andrey Borodin
amborodin@acm.org
In reply to: Michael Paquier (#5)
Re: pglz performance

17 мая 2019 г., в 6:44, Michael Paquier <michael@paquier.xyz> написал(а):

That's nice.

From the numbers you are presenting here, all of them are much better
than the original, and there is not much difference between any of the
patched versions. Having a 20%~30% improvement with a patch is very
nice.

After that comes the simplicity and the future maintainability of what
is proposed. I am not much into accepting a patch which has a 1%~2%
impact for some hardwares and makes pglz much more complex and harder
to understand. But I am really eager to see a patch with at least a
10% improvement which remains simple, even more if it simplifies the
logic used in pglz.

Here are patches for both winning versions. I'll place them on CF.
My gut feeling is pglz_decompress_hacked8 should be better, but on most architectures benchmarks show opposite.

Best regards, Andrey Borodin.

Attachments:

pglz_decompress_hacked8.diffapplication/octet-stream; name=pglz_decompress_hacked8.diff; x-unix-mode=0644Download+21-6
pglz_decompress_hacked.diffapplication/octet-stream; name=pglz_decompress_hacked.diff; x-unix-mode=0644Download+9-6
#7Gasper Zejn
zejn@owca.info
In reply to: Andrey Borodin (#4)
Re: pglz performance

On 16. 05. 19 19:13, Andrey Borodin wrote:

15 мая 2019 г., в 15:06, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

Owners of AMD and ARM devices are welcome.

I've tested according to instructions at the test repo
https://github.com/x4m/test_pglz

Test_pglz is at a97f63b and postgres at 6ba500.

Hardware is desktop AMD Ryzen 5 2600, 32GB RAM

Decompressor score (summ of all times):

NOTICE:  Decompressor pglz_decompress_hacked result 6.988909
NOTICE:  Decompressor pglz_decompress_hacked8 result 7.562619
NOTICE:  Decompressor pglz_decompress_hacked16 result 8.316957
NOTICE:  Decompressor pglz_decompress_vanilla result 10.725826

Attached is the full test run, if needed.

Kind regards,

Gasper

Show quoted text

Yandex hardware RND guys gave me ARM server and Power9 server. They are looking for AMD and some new Intel boxes.

Meanwhile I made some enhancements to test suit:
1. I've added Shakespeare payload: concatenation of works of this prominent poet.
2. For each payload compute "sliced time" - time to decompress payload if it was sliced by 2Kb pieces or 8Kb pieces.
3. For each decompressor we compute "score": (sum of time to decompress each payload, each payload sliced by 2Kb and 8Kb) * 5 times

I've attached full test logs, meanwhile here's results for different platforms.

Intel Server
NOTICE: 00000: Decompressor pglz_decompress_hacked result 10.346763
NOTICE: 00000: Decompressor pglz_decompress_hacked8 result 11.192078
NOTICE: 00000: Decompressor pglz_decompress_hacked16 result 11.957727
NOTICE: 00000: Decompressor pglz_decompress_vanilla result 14.262256

ARM Server
NOTICE: Decompressor pglz_decompress_hacked result 12.966668
NOTICE: Decompressor pglz_decompress_hacked8 result 13.004935
NOTICE: Decompressor pglz_decompress_hacked16 result 13.043015
NOTICE: Decompressor pglz_decompress_vanilla result 18.239242

Power9 Server
NOTICE: Decompressor pglz_decompress_hacked result 10.992974
NOTICE: Decompressor pglz_decompress_hacked8 result 11.747443
NOTICE: Decompressor pglz_decompress_hacked16 result 11.026342
NOTICE: Decompressor pglz_decompress_vanilla result 16.375315

Intel laptop
NOTICE: Decompressor pglz_decompress_hacked result 9.445808
NOTICE: Decompressor pglz_decompress_hacked8 result 9.105360
NOTICE: Decompressor pglz_decompress_hacked16 result 9.621833
NOTICE: Decompressor pglz_decompress_vanilla result 10.661968

From these results pglz_decompress_hacked looks best.

Best regards, Andrey Borodin.

Attachments:

pglz_benchmarks_amd.txttext/plain; charset=UTF-8; name=pglz_benchmarks_amd.txtDownload
#8Andrey Borodin
amborodin@acm.org
In reply to: Gasper Zejn (#7)
Re: pglz performance

17 мая 2019 г., в 18:40, Gasper Zejn <zejn@owca.info> написал(а):

I've tested according to instructions at the test repo
https://github.com/x4m/test_pglz

Test_pglz is at a97f63b and postgres at 6ba500.

Hardware is desktop AMD Ryzen 5 2600, 32GB RAM

Decompressor score (summ of all times):

NOTICE: Decompressor pglz_decompress_hacked result 6.988909
NOTICE: Decompressor pglz_decompress_hacked8 result 7.562619
NOTICE: Decompressor pglz_decompress_hacked16 result 8.316957
NOTICE: Decompressor pglz_decompress_vanilla result 10.725826

Thanks, Gasper! Basically we observe same 0.65 time reduction here.

That's very good that we have independent scores.

I'm still somewhat not sure that score is fair, on payload 000000010000000000000008 we have vanilla decompression sometimes slower than hacked by few percents. And this is especially visible on AMD. Degradation for 000000010000000000000008 sliced by 8Kb reaches 10%

I think this is because 000000010000000000000008 have highest entropy.It is almost random and matches are very short, but present.
000000010000000000000008
Entropy = 4.360546 bits per byte.
000000010000000000000006
Entropy = 1.450059 bits per byte.
000000010000000000000001
Entropy = 2.944235 bits per byte.
shakespeare.txt
Entropy = 3.603659 bits per byte
16398
Entropy = 1.897640 bits per byte.

Best regards, Andrey Borodin.

#9Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#8)
Re: pglz performance

18 мая 2019 г., в 11:44, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

Hi!
Here's rebased version of patches.

Best regards, Andrey Borodin.

Attachments:

0001-Use-memcpy-in-pglz-decompression-for-long-matches.patchapplication/octet-stream; name=0001-Use-memcpy-in-pglz-decompression-for-long-matches.patch; x-unix-mode=0644Download+21-7
0001-Use-memcpy-in-pglz-decompression.patchapplication/octet-stream; name=0001-Use-memcpy-in-pglz-decompression.patch; x-unix-mode=0644Download+9-7
#10Andrey Borodin
amborodin@acm.org
In reply to: Michael Paquier (#2)
Re: pglz performance

13 мая 2019 г., в 12:14, Michael Paquier <michael@paquier.xyz> написал(а):

Decompression can matter a lot for mostly-read workloads and
compression can become a bottleneck for heavy-insert loads, so
improving compression or decompression should be two separate
problems, not two problems linked. Any improvement in one or the
other, or even both, is nice to have.

Here's patch hacked by Vladimir for compression.

Key differences (as far as I see, maybe Vladimir will post more complete list of optimizations):
1. Use functions instead of macro-functions: not surprisingly it's easier to optimize them and provide less constraints for compiler to optimize.
2. More compact hash table: use indexes instead of pointers.
3. More robust segment comparison: like memcmp, but return index of first different byte

In weighted mix of different data (same as for compression), overall speedup is x1.43 on my machine.

Current implementation is integrated into test_pglz suit for benchmarking purposes[0]https://github.com/x4m/test_pglz.

Best regards, Andrey Borodin.

[0]: https://github.com/x4m/test_pglz

Attachments:

0001-Reorganize-pglz-compression-code.patchapplication/octet-stream; name=0001-Reorganize-pglz-compression-code.patch; x-unix-mode=0644Download+305-209
#11Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andrey Borodin (#10)
Re: pglz performance

On 27.06.2019 21:33, Andrey Borodin wrote:

13 мая 2019 г., в 12:14, Michael Paquier <michael@paquier.xyz> написал(а):

Decompression can matter a lot for mostly-read workloads and
compression can become a bottleneck for heavy-insert loads, so
improving compression or decompression should be two separate
problems, not two problems linked. Any improvement in one or the
other, or even both, is nice to have.

Here's patch hacked by Vladimir for compression.

Key differences (as far as I see, maybe Vladimir will post more complete list of optimizations):
1. Use functions instead of macro-functions: not surprisingly it's easier to optimize them and provide less constraints for compiler to optimize.
2. More compact hash table: use indexes instead of pointers.
3. More robust segment comparison: like memcmp, but return index of first different byte

In weighted mix of different data (same as for compression), overall speedup is x1.43 on my machine.

Current implementation is integrated into test_pglz suit for benchmarking purposes[0].

Best regards, Andrey Borodin.

[0] https://github.com/x4m/test_pglz

It takes me some time to understand that your memcpy optimization is
correct;)
I have tested different ways of optimizing this fragment of code, but
failed tooutperform your implementation!
Results at my computer is simlar with yours:

Decompressor score (summ of all times):
NOTICE:  Decompressor pglz_decompress_hacked result 6.627355
NOTICE:  Decompressor pglz_decompress_hacked_unrolled result 7.497114
NOTICE:  Decompressor pglz_decompress_hacked8 result 7.412944
NOTICE:  Decompressor pglz_decompress_hacked16 result 7.792978
NOTICE:  Decompressor pglz_decompress_vanilla result 10.652603

Compressor score (summ of all times):
NOTICE:  Compressor pglz_compress_vanilla result 116.970005
NOTICE:  Compressor pglz_compress_hacked result 89.706105

But ...  below are results for lz4:

Decompressor score (summ of all times):
NOTICE:  Decompressor lz4_decompress result 3.660066
Compressor score (summ of all times):
NOTICE:  Compressor lz4_compress result 10.288594

There is 2 times advantage in decompress speed and 10 times advantage in
compress speed.
So may be instead of "hacking" pglz algorithm we should better switch to
lz4?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#12Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Konstantin Knizhnik (#11)
Re: pglz performance

On Fri, Aug 02, 2019 at 04:45:43PM +0300, Konstantin Knizhnik wrote:

On 27.06.2019 21:33, Andrey Borodin wrote:

13 мая 2019 г., в 12:14, Michael Paquier <michael@paquier.xyz> написал(а):

Decompression can matter a lot for mostly-read workloads and
compression can become a bottleneck for heavy-insert loads, so
improving compression or decompression should be two separate
problems, not two problems linked. Any improvement in one or the
other, or even both, is nice to have.

Here's patch hacked by Vladimir for compression.

Key differences (as far as I see, maybe Vladimir will post more complete list of optimizations):
1. Use functions instead of macro-functions: not surprisingly it's easier to optimize them and provide less constraints for compiler to optimize.
2. More compact hash table: use indexes instead of pointers.
3. More robust segment comparison: like memcmp, but return index of first different byte

In weighted mix of different data (same as for compression), overall speedup is x1.43 on my machine.

Current implementation is integrated into test_pglz suit for benchmarking purposes[0].

Best regards, Andrey Borodin.

[0] https://github.com/x4m/test_pglz

It takes me some time to understand that your memcpy optimization is
correct;)
I have tested different ways of optimizing this fragment of code, but
failed tooutperform your implementation!
Results at my computer is simlar with yours:

Decompressor score (summ of all times):
NOTICE:  Decompressor pglz_decompress_hacked result 6.627355
NOTICE:  Decompressor pglz_decompress_hacked_unrolled result 7.497114
NOTICE:  Decompressor pglz_decompress_hacked8 result 7.412944
NOTICE:  Decompressor pglz_decompress_hacked16 result 7.792978
NOTICE:  Decompressor pglz_decompress_vanilla result 10.652603

Compressor score (summ of all times):
NOTICE:  Compressor pglz_compress_vanilla result 116.970005
NOTICE:  Compressor pglz_compress_hacked result 89.706105

But ...  below are results for lz4:

Decompressor score (summ of all times):
NOTICE:  Decompressor lz4_decompress result 3.660066
Compressor score (summ of all times):
NOTICE:  Compressor lz4_compress result 10.288594

There is 2 times advantage in decompress speed and 10 times advantage
in compress speed.
So may be instead of "hacking" pglz algorithm we should better switch
to lz4?

I think we should just bite the bullet and add initdb option to pick
compression algorithm. That's been discussed repeatedly, but we never
ended up actually doing that. See for example [1]/messages/by-id/55341569.1090107@2ndquadrant.com.

If there's anyone willing to put some effort into getting this feature
over the line, I'm willing to do reviews & commit. It's a seemingly
small change with rather insane potential impact.

But even if we end up doing that, it still makes sense to optimize the
hell out of pglz, because existing systems will still use that
(pg_upgrade can't switch from one compression algorithm to another).

regards

[1]: /messages/by-id/55341569.1090107@2ndquadrant.com

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Andrey Borodin
amborodin@acm.org
In reply to: Tomas Vondra (#12)
Re: pglz performance

Thanks for looking into this!

2 авг. 2019 г., в 19:43, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):

On Fri, Aug 02, 2019 at 04:45:43PM +0300, Konstantin Knizhnik wrote:

It takes me some time to understand that your memcpy optimization is correct;)

Seems that comments are not explanatory enough... will try to fix.

I have tested different ways of optimizing this fragment of code, but failed tooutperform your implementation!

JFYI we tried optimizations with memcpy with const size (optimized into assembly instead of call), unrolling literal loop and some others. All these did not work better.

But ... below are results for lz4:

Decompressor score (summ of all times):
NOTICE: Decompressor lz4_decompress result 3.660066
Compressor score (summ of all times):
NOTICE: Compressor lz4_compress result 10.288594

There is 2 times advantage in decompress speed and 10 times advantage in compress speed.
So may be instead of "hacking" pglz algorithm we should better switch to lz4?

I think we should just bite the bullet and add initdb option to pick
compression algorithm. That's been discussed repeatedly, but we never
ended up actually doing that. See for example [1].

If there's anyone willing to put some effort into getting this feature
over the line, I'm willing to do reviews & commit. It's a seemingly
small change with rather insane potential impact.

But even if we end up doing that, it still makes sense to optimize the
hell out of pglz, because existing systems will still use that
(pg_upgrade can't switch from one compression algorithm to another).

We have some kind of "roadmap" of "extensible pglz". We plan to provide implementation on Novembers CF.

Currently, pglz starts with empty cache map: there is no prior 4k bytes before start. We can add imaginary prefix to any data with common substrings: this will enhance compression ratio.
It is hard to decide on training data set for this "common prefix". So we want to produce extension with aggregate function which produces some "adapted common prefix" from users's data.
Then we can "reserve" few negative bytes for "decompression commands". This command can instruct database on which common prefix to use.
But also system command can say "invoke decompression from extension".

Thus, user will be able to train database compression on his data and substitute pglz compression with custom compression method seamlessly.

This will make hard-choosen compression unneeded, but seems overly hacky. But there will be no need to have lz4, zstd, brotli, lzma and others in core. Why not provide e.g. "time series compression"? Or "DNA compression"? Whatever gun user wants for his foot.

Best regards, Andrey Borodin.

#14Andres Freund
andres@anarazel.de
In reply to: Andrey Borodin (#13)
Re: pglz performance

Hi,

On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:

We have some kind of "roadmap" of "extensible pglz". We plan to provide implementation on Novembers CF.

I don't understand why it's a good idea to improve the compression side
of pglz. There's plenty other people that spent a lot of time developing
better compression algorithms.

Currently, pglz starts with empty cache map: there is no prior 4k bytes before start. We can add imaginary prefix to any data with common substrings: this will enhance compression ratio.
It is hard to decide on training data set for this "common prefix". So we want to produce extension with aggregate function which produces some "adapted common prefix" from users's data.
Then we can "reserve" few negative bytes for "decompression commands". This command can instruct database on which common prefix to use.
But also system command can say "invoke decompression from extension".

Thus, user will be able to train database compression on his data and substitute pglz compression with custom compression method seamlessly.

This will make hard-choosen compression unneeded, but seems overly hacky. But there will be no need to have lz4, zstd, brotli, lzma and others in core. Why not provide e.g. "time series compression"? Or "DNA compression"? Whatever gun user wants for his foot.

I think this is way too complicated, and will provide not particularly
much benefit for the majority users.

In fact, I'll argue that we should flat out reject any such patch until
we have at least one decent default compression algorithm in
core. You're trying to work around a poor compression algorithm with
complicated dictionary improvement, that require user interaction, and
only will work in a relatively small subset of the cases, and will very
often increase compression times.

Greetings,

Andres Freund

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#14)
Re: pglz performance

On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote:

Hi,

On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:

We have some kind of "roadmap" of "extensible pglz". We plan to
provide implementation on Novembers CF.

I don't understand why it's a good idea to improve the compression side
of pglz. There's plenty other people that spent a lot of time
developing better compression algorithms.

Isn't it beneficial for existing systems, that will be stuck with pglz
even if we end up adding other algorithms?

Currently, pglz starts with empty cache map: there is no prior 4k
bytes before start. We can add imaginary prefix to any data with
common substrings: this will enhance compression ratio. It is hard
to decide on training data set for this "common prefix". So we want
to produce extension with aggregate function which produces some
"adapted common prefix" from users's data. Then we can "reserve" few
negative bytes for "decompression commands". This command can
instruct database on which common prefix to use. But also system
command can say "invoke decompression from extension".

Thus, user will be able to train database compression on his data and
substitute pglz compression with custom compression method
seamlessly.

This will make hard-choosen compression unneeded, but seems overly
hacky. But there will be no need to have lz4, zstd, brotli, lzma and
others in core. Why not provide e.g. "time series compression"? Or
"DNA compression"? Whatever gun user wants for his foot.

I think this is way too complicated, and will provide not particularly
much benefit for the majority users.

I agree with this. I do see value in the feature, but probably not as a
drop-in replacement for the default compression algorithm. I'd compare
it to the "custom compression methods" patch that was submitted some
time ago.

In fact, I'll argue that we should flat out reject any such patch until
we have at least one decent default compression algorithm in core.
You're trying to work around a poor compression algorithm with
complicated dictionary improvement, that require user interaction, and
only will work in a relatively small subset of the cases, and will very
often increase compression times.

I wouldn't be so strict I guess. But I do agree an algorithm that
requires additional steps (training, ...) is unlikely to be a good
candidate for default instance compression alorithm.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#15)
Re: pglz performance

Hi,

On 2019-08-02 19:00:39 +0200, Tomas Vondra wrote:

On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote:

Hi,

On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:

We have some kind of "roadmap" of "extensible pglz". We plan to
provide implementation on Novembers CF.

I don't understand why it's a good idea to improve the compression side
of pglz. There's plenty other people that spent a lot of time
developing better compression algorithms.

Isn't it beneficial for existing systems, that will be stuck with pglz
even if we end up adding other algorithms?

Why would they be stuck continuing to *compress* with pglz? As we
fully retoast on write anyway we can just gradually switch over to the
better algorithm. Decompression speed is another story, of course.

Here's what I had a few years back:

/messages/by-id/20130621000900.GA12425@alap2.anarazel.de
see also
/messages/by-id/20130605150144.GD28067@alap2.anarazel.de

I think we should refresh something like that patch, and:
- make the compression algorithm GUC an enum, rename
- add --with-system-lz4
- obviously refresh the copy of lz4
- drop snappy

Greetings,

Andres Freund

#17Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#16)
Re: pglz performance

On Fri, Aug 02, 2019 at 10:12:58AM -0700, Andres Freund wrote:

Hi,

On 2019-08-02 19:00:39 +0200, Tomas Vondra wrote:

On Fri, Aug 02, 2019 at 09:39:48AM -0700, Andres Freund wrote:

Hi,

On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:

We have some kind of "roadmap" of "extensible pglz". We plan to
provide implementation on Novembers CF.

I don't understand why it's a good idea to improve the compression side
of pglz. There's plenty other people that spent a lot of time
developing better compression algorithms.

Isn't it beneficial for existing systems, that will be stuck with pglz
even if we end up adding other algorithms?

Why would they be stuck continuing to *compress* with pglz? As we
fully retoast on write anyway we can just gradually switch over to the
better algorithm. Decompression speed is another story, of course.

Hmmm, I don't remember the details of those patches so I didn't realize
it allows incremental recompression. If that's possible, that would mean
existing systems can start using it. Which is good.

Another question is whether we'd actually want to include the code in
core directly, or use system libraries (and if some packagers might
decide to disable that, for whatever reason).

But yeah, I agree you may have a point about optimizing pglz compression.

Here's what I had a few years back:

/messages/by-id/20130621000900.GA12425@alap2.anarazel.de
see also
/messages/by-id/20130605150144.GD28067@alap2.anarazel.de

I think we should refresh something like that patch, and:
- make the compression algorithm GUC an enum, rename
- add --with-system-lz4
- obviously refresh the copy of lz4
- drop snappy

That's a reasonable plan, I guess.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#17)
Re: pglz performance

Hi,

On 2019-08-02 19:52:39 +0200, Tomas Vondra wrote:

Hmmm, I don't remember the details of those patches so I didn't realize
it allows incremental recompression. If that's possible, that would mean
existing systems can start using it. Which is good.

That depends on what do you mean by "incremental"? A single toasted
datum can only have one compression type, because we only update them
all in one anyway. But different datums can be compressed differently.

Another question is whether we'd actually want to include the code in
core directly, or use system libraries (and if some packagers might
decide to disable that, for whatever reason).

I'd personally say we should have an included version, and a
--with-system-... flag that uses the system one.

Greetings,

Andres Freund

#19Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#18)
Re: pglz performance

On Fri, Aug 02, 2019 at 11:20:03AM -0700, Andres Freund wrote:

Hi,

On 2019-08-02 19:52:39 +0200, Tomas Vondra wrote:

Hmmm, I don't remember the details of those patches so I didn't realize
it allows incremental recompression. If that's possible, that would mean
existing systems can start using it. Which is good.

That depends on what do you mean by "incremental"? A single toasted
datum can only have one compression type, because we only update them
all in one anyway. But different datums can be compressed differently.

I meant different toast values using different compression algorithm,
sorry for the confusion.

Another question is whether we'd actually want to include the code in
core directly, or use system libraries (and if some packagers might
decide to disable that, for whatever reason).

I'd personally say we should have an included version, and a
--with-system-... flag that uses the system one.

OK. I'd say to require a system library, but that's a minor detail.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andres Freund (#18)
Re: pglz performance

On 02.08.2019 21:20, Andres Freund wrote:

Another question is whether we'd actually want to include the code in

core directly, or use system libraries (and if some packagers might
decide to disable that, for whatever reason).

I'd personally say we should have an included version, and a
--with-system-... flag that uses the system one.

+1

#21Petr Jelinek
petr@2ndquadrant.com
In reply to: Tomas Vondra (#19)
#22Andrey Borodin
amborodin@acm.org
In reply to: Petr Jelinek (#21)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Petr Jelinek (#21)
#24Petr Jelinek
petr@2ndquadrant.com
In reply to: Andrey Borodin (#22)
#25Petr Jelinek
petr@2ndquadrant.com
In reply to: Tomas Vondra (#23)
#26Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Petr Jelinek (#25)
#27Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#21)
#28Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#27)
#29Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#25)
#30Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#29)
#31Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#17)
#32Andres Freund
andres@anarazel.de
In reply to: Petr Jelinek (#30)
#33Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#31)
#34Peter Eisentraut
peter_e@gmx.net
In reply to: Andrey Borodin (#9)
#35Andrey Borodin
amborodin@acm.org
In reply to: Peter Eisentraut (#34)
#36Peter Eisentraut
peter_e@gmx.net
In reply to: Andrey Borodin (#35)
#37Andrey Borodin
amborodin@acm.org
In reply to: Peter Eisentraut (#36)
#38Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrey Borodin (#1)
#39Oleg Bartunov
oleg@sai.msu.su
In reply to: Andrey Borodin (#35)
#40Petr Jelinek
petr@2ndquadrant.com
In reply to: Andres Freund (#32)
#41Peter Eisentraut
peter_e@gmx.net
In reply to: Andrey Borodin (#37)
#42Andrey Borodin
amborodin@acm.org
In reply to: Peter Eisentraut (#41)
#43Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#42)
#44Andrey Borodin
amborodin@acm.org
In reply to: Andrey Borodin (#43)
#45Peter Eisentraut
peter_e@gmx.net
In reply to: Andrey Borodin (#44)
#46Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Eisentraut (#45)
#47Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#46)
#48Andrey Borodin
amborodin@acm.org
In reply to: Tomas Vondra (#47)
#49Tels
nospam-pg-abuse@bloodgate.com
In reply to: Andrey Borodin (#48)
#50Andrey Borodin
amborodin@acm.org
In reply to: Tels (#49)
#51Peter Eisentraut
peter_e@gmx.net
In reply to: Alvaro Herrera (#46)
#52Peter Eisentraut
peter_e@gmx.net
In reply to: Tomas Vondra (#47)
#53Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#52)
#54Andrey Borodin
amborodin@acm.org
In reply to: Michael Paquier (#53)
#55Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#54)
#56Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#55)
#57Peter Eisentraut
peter_e@gmx.net
In reply to: Tomas Vondra (#56)
#58Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Eisentraut (#57)
#59Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#58)
#60Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#59)
#61Andrey Borodin
amborodin@acm.org
In reply to: Tomas Vondra (#56)
#62Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andrey Borodin (#61)
#63Andrey Borodin
amborodin@acm.org
In reply to: Tomas Vondra (#62)
#64Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrey Borodin (#63)
#65Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#62)
#66Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andrey Borodin (#63)
#67Andrey Borodin
amborodin@acm.org
In reply to: Tomas Vondra (#66)
#68Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#67)
#69Andrey Borodin
amborodin@acm.org
In reply to: Petr Jelinek (#21)