Compression of bigger WAL records

Started by Andrey Borodinabout 1 year ago14 messages

Jump to latest

Andrey Borodin

amborodin@acm.org

about 1 year ago

Hi hackers!

I propose a slight change to WAL compression: compress body of big records, if it's bigger than some threshold.

===Rationale===
0. Better compression ratio for full page images when pages are compressed together.

Consider following test:

set wal_compression to 'zstd';
create table a as select random() from generate_series(1,1e7);
create index on a(random ); -- warmup to avoid FPI for hint on the heap
select pg_stat_reset_shared('wal'); create index on a(random ); select pg_size_pretty(wal_bytes) from pg_stat_wal;

B-tree index will emit 97Mb of WAL instead of 125Mb when FPIs are compressed independently.

1. Compression of big records, that are not FPI. E.g. 2-pc records might be big enough to cross a threshold.

2. This might be a path to full WAL compression. In future I plan to propose a compression context: retaining compression dictionary between records. Obviously, the context cannot cross checkpoint borders. And a pool of contexts would be needed to fully utilize efficiency of compression codecs. Anyway - it's too early to theorize.

===Propotype===
I attach a prototype patch. It is functional, but some world tests fail. Probably, because they expect to generate more WAL without putting too much of entropy. Or, perhaps, I missed some bugs. In present version WAL_DEBUG does not indicate any problems. But a lot of quality assurance and commenting work is needed. It's a prototype.

To indicate that WAL record is compressed I use a bit in record->xl_info (XLR_COMPRESSED == 0x04). I found no places that use this bit...
If the record is compressed, record header is continued with information about compression: codec byte and uint32 of uncompressed xl_tot_len.

Currently, compression is done on StringInfo buffers, that are expanded before actual WALInsert() happens. If palloc() is needed during critical section, the compression is canceled. I do not like memory accounting before WALInsert, probably, something clever can be done about it.

WAL_DEBUG and wal_compression are enabled for debugging purposes. Of course, I do not propose to turn them on by default.

What do you think? Does this approach seem viable?

Best regards, Andrey Borodin.

Kirill Reshke

reshkekirill@gmail.com

about 1 year ago

In reply to: Andrey Borodin (#1)

Re: Compression of bigger WAL records

./pgbin/bin/pg_waldump

On Sun, 12 Jan 2025 at 17:43, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

Hi hackers!

I propose a slight change to WAL compression: compress body of big records, if it's bigger than some threshold.

Hi,
initdb fails when configured with --without-zstd

```
reshke@ygp-jammy:~/postgres$ ./pgbin/bin/initdb -D db
The files belonging to this database system will be owned by user "reshke".
This user must also own the server process.

The database cluster will be initialized with locale "C.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are enabled.

creating directory db ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default "max_connections" ... 100
selecting default "autovacuum_worker_slots" ... 16
selecting default "shared_buffers" ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... 2025-01-12 18:10:47.657 UTC [4167965]
FATAL: zstd is not supported by this build
2025-01-12 18:10:47.657 UTC [4167965] PANIC: cannot abort transaction
1, it was already committed
Aborted (core dumped)
child process exited with exit code 134
initdb: removing data directory "db"
```

Also pg_waldump fails with

```
corrupted size vs. prev_size
Aborted (core dumped)
```

Best regards,
Kirill Reshke

Andrey Borodin

amborodin@acm.org

about 1 year ago

In reply to: Kirill Reshke (#2)

Re: Compression of bigger WAL records

Hi! Thanks for looking into this!

On 12 Jan 2025, at 23:36, Kirill Reshke <reshkekirill@gmail.com> wrote:

initdb fails when configured with --without-zstd

Yes, the patch is intended to demonstrate improvement when using Zstd.

On 12 Jan 2025, at 17:43, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

WAL_DEBUG and wal_compression are enabled for debugging purposes. Of course, I do not propose to turn them on by default.

And this does not work well --without-zstd.

Also pg_waldump fails with

```
corrupted size vs. prev_size
Aborted (core dumped)
```

I’ll fix that, thanks!
Also seems like I forgot to bump WAL_FILE_MAGIC…

What do you think about proposed approach?

Best regards, Andrey Borodin.

Andrey Borodin

amborodin@acm.org

about 1 year ago

In reply to: Andrey Borodin (#1)

Re: Compression of bigger WAL records

On 12 Jan 2025, at 17:43, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

I attach a prototype patch.

Here's v2, now it passes all the tests with wal_debug.

Some stats. On this test

create table a as select random() from generate_series(1,1e7);
select pg_stat_reset_shared('wal'); create index on a(random ); select pg_size_pretty(wal_bytes) from pg_stat_wal;
set wal_compression to 'lz4';
select pg_stat_reset_shared('wal'); create index on a(random ); select pg_size_pretty(wal_bytes) from pg_stat_wal;
set wal_compression to 'pglz';
select pg_stat_reset_shared('wal'); create index on a(random ); select pg_size_pretty(wal_bytes) from pg_stat_wal;
set wal_compression to 'zstd';
select pg_stat_reset_shared('wal'); create index on a(random ); select pg_size_pretty(wal_bytes) from pg_stat_wal;

I observe WAL size of the index:
method HEAD patched
pglz 193 MB 193 MB
lz4 160 MB 132 MB
zstd 125 MB 97 MB

So, for lz4 and zstd this seems to be a significant reduction.

I'm planning to work on improving the patch quality.

Thanks!

Best regards, Andrey Borodin.

Japin Li

japinli@hotmail.com

about 1 year ago

In reply to: Andrey Borodin (#4)

Re: Compression of bigger WAL records

On Tue, 21 Jan 2025 at 23:24, "Andrey M. Borodin" <x4mmm@yandex-team.ru> wrote:

On 12 Jan 2025, at 17:43, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

I attach a prototype patch.

Here's v2, now it passes all the tests with wal_debug.

Some stats. On this test

create table a as select random() from generate_series(1,1e7);
select pg_stat_reset_shared('wal'); create index on a(random ); select pg_size_pretty(wal_bytes) from pg_stat_wal;
set wal_compression to 'lz4';
select pg_stat_reset_shared('wal'); create index on a(random ); select pg_size_pretty(wal_bytes) from pg_stat_wal;
set wal_compression to 'pglz';
select pg_stat_reset_shared('wal'); create index on a(random ); select pg_size_pretty(wal_bytes) from pg_stat_wal;
set wal_compression to 'zstd';
select pg_stat_reset_shared('wal'); create index on a(random ); select pg_size_pretty(wal_bytes) from pg_stat_wal;

I observe WAL size of the index:
method HEAD patched
pglz 193 MB 193 MB
lz4 160 MB 132 MB
zstd 125 MB 97 MB

So, for lz4 and zstd this seems to be a significant reduction.

I'm planning to work on improving the patch quality.

Thanks!

Hi, Andrey Borodin

I find this feature interesting; however, it cannot be applied to the current
master (b35434b134b) due to commit 32a18cc0a73.

Applying: Compress big WAL records
.git/rebase-apply/patch:83: trailing whitespace.

.git/rebase-apply/patch:90: trailing whitespace.

.git/rebase-apply/patch:315: trailing whitespace.

.git/rebase-apply/patch:780: trailing whitespace.
else
error: contrib/pg_walinspect/pg_walinspect.c: does not match index
error: src/backend/access/rmgrdesc/xlogdesc.c: does not match index
error: src/backend/access/transam/xlog.c: does not match index
error: src/backend/access/transam/xloginsert.c: does not match index
error: src/backend/access/transam/xlogreader.c: does not match index
error: src/backend/utils/misc/guc_tables.c: does not match index
error: src/backend/utils/misc/postgresql.conf.sample: does not match index
error: src/include/access/xlog.h: does not match index
error: src/include/access/xloginsert.h: does not match index
error: src/include/access/xlogreader.h: does not match index
error: src/include/access/xlogrecord.h: does not match index
error: src/include/pg_config_manual.h: does not match index
error: src/test/recovery/t/026_overwrite_contrecord.pl: does not match index
error: patch failed: src/test/recovery/t/039_end_of_wal.pl:81
error: src/test/recovery/t/039_end_of_wal.pl: patch does not apply
Patch failed at 0001 Compress big WAL records
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

I see the patch compresses the WAL record according to the wal_compression,
IIRC the wal_compression is only used for FPI, right? Maybe we should update
the description of this parameter.

I see that the wal_compression_threshold defaults to 512. I wonder if you
chose this value based on testing or randomly.

--
Regrads,
Japin Li

Fujii Masao

masao.fujii@gmail.com

about 1 year ago

In reply to: Andrey Borodin (#4)

Re: Compression of bigger WAL records

On 2025/01/22 3:24, Andrey M. Borodin wrote:

On 12 Jan 2025, at 17:43, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

I attach a prototype patch.

Here's v2, now it passes all the tests with wal_debug.

I like the idea of WAL compression more.

With the current approach, each backend needs to allocate memory twice
the size of the total WAL record. Right? One area is for the gathered
WAL record data (from rdt and registered_buffers), and the other is for
storing the compressed data. Could this lead to potential memory usage
concerns? Perhaps we should consider setting a limit on the maximum
memory each backend can use for WAL compression?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Andrey Borodin

amborodin@acm.org

about 1 year ago

In reply to: Fujii Masao (#6)

Re: Compression of bigger WAL records

On 23 Jan 2025, at 20:13, Japin Li <japinli@hotmail.com> wrote:

I find this feature interesting;

Thank you for your interest in the patch!

however, it cannot be applied to the current
master (b35434b134b) due to commit 32a18cc0a73.

PFA a rebased version.

I see the patch compresses the WAL record according to the wal_compression,
IIRC the wal_compression is only used for FPI, right? Maybe we should update
the description of this parameter.

Yes, I'll udpate documentation in future versions too.

I see that the wal_compression_threshold defaults to 512. I wonder if you
chose this value based on testing or randomly.

Voices in my head told me it's a good number.

On 28 Jan 2025, at 22:10, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

I like the idea of WAL compression more.

Thank you!

With the current approach, each backend needs to allocate memory twice
the size of the total WAL record. Right? One area is for the gathered
WAL record data (from rdt and registered_buffers), and the other is for
storing the compressed data.

Yes, exactly. And also a decompression buffer for each WAL reader.

Could this lead to potential memory usage
concerns? Perhaps we should consider setting a limit on the maximum
memory each backend can use for WAL compression?

Yes, the limit makes sense.

Also, we can reduce memory consumption by employing a streaming compression. Currently, I'm working on a prototype of such technology, because it would allow wholesale WAL compression. The idea is to reuse compression context from previous records to better compress new records. This would allow efficient compression of even very small records. However, there is exactly 0 chance to get it done in a decent shape before feature freeze.

The chances of getting currently proposed approach to v18 seems slim either... I'm hesitating to register this patch on the CF. What do you think?

Best regards, Andrey Borodin.

wenhui qiu

qiuwenhuifx@gmail.com

about 1 year ago

In reply to: Andrey Borodin (#7)

Re: Compression of bigger WAL records

Hi Andery
I have a question ,If wal_compression_threshold is set to more than
the block size of the wal log, then the FPI is not compressed, and if so,
it might make sense to have a maximum value of this parameter that does not
exceed the block size of the wal log？

Best regards

On Thu, Jan 30, 2025 at 9:26 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Show quoted text

On 23 Jan 2025, at 20:13, Japin Li <japinli@hotmail.com> wrote:

I find this feature interesting;

Thank you for your interest in the patch!

however, it cannot be applied to the current
master (b35434b134b) due to commit 32a18cc0a73.

PFA a rebased version.

I see the patch compresses the WAL record according to the

wal_compression,

IIRC the wal_compression is only used for FPI, right? Maybe we should

update

the description of this parameter.

Yes, I'll udpate documentation in future versions too.

I see that the wal_compression_threshold defaults to 512. I wonder if you
chose this value based on testing or randomly.

Voices in my head told me it's a good number.

On 28 Jan 2025, at 22:10, Fujii Masao <masao.fujii@oss.nttdata.com>

wrote:

I like the idea of WAL compression more.

Thank you!

With the current approach, each backend needs to allocate memory twice
the size of the total WAL record. Right? One area is for the gathered
WAL record data (from rdt and registered_buffers), and the other is for
storing the compressed data.

Yes, exactly. And also a decompression buffer for each WAL reader.

Could this lead to potential memory usage
concerns? Perhaps we should consider setting a limit on the maximum
memory each backend can use for WAL compression?

Yes, the limit makes sense.

Also, we can reduce memory consumption by employing a streaming
compression. Currently, I'm working on a prototype of such technology,
because it would allow wholesale WAL compression. The idea is to reuse
compression context from previous records to better compress new records.
This would allow efficient compression of even very small records. However,
there is exactly 0 chance to get it done in a decent shape before feature
freeze.

The chances of getting currently proposed approach to v18 seems slim
either... I'm hesitating to register this patch on the CF. What do you
think?

Best regards, Andrey Borodin.

Andrey Borodin

amborodin@acm.org

8 months ago

In reply to: wenhui qiu (#8)

Re: Compression of bigger WAL records

On 31 Jan 2025, at 08:37, wenhui qiu <qiuwenhuifx@gmail.com> wrote:

Hi Andery
I have a question ,If wal_compression_threshold is set to more than the block size of the wal log, then the FPI is not compressed, and if so, it might make sense to have a maximum value of this parameter that does not exceed the block size of the wal log？

Oops, looks like I missed your question. Sorry for so long delay.

User might want to compress only megabyte+ records, there's nothing wrong with it. WAL record itself is capped by 1Gb (XLogRecordMaxSize), I do not see a reason to restrict wal_compression_threshold by lower value.

PFA rebased version.

Best regards, Andrey Borodin.

#10

Andrey Borodin

amborodin@acm.org

about 2 months ago

In reply to: Andrey Borodin (#9)

Re: Compression of bigger WAL records

On 14 Jul 2025, at 23:22, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

PFA rebased version.

Here's a rebased version. Also I fixed a problem of possible wrong memory context used for allocating compression buffer.

Best regards, Andrey Borodin.

#11

Fujii Masao

masao.fujii@gmail.com

about 2 months ago

In reply to: Andrey Borodin (#10)

Re: Compression of bigger WAL records

On Mon, Jan 12, 2026 at 2:54 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

On 14 Jul 2025, at 23:22, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

PFA rebased version.

Here's a rebased version. Also I fixed a problem of possible wrong memory context used for allocating compression buffer.

Thanks for updating the patch!

With the v5 patch, I see the following compiler warning:

xlog.c:726:1: warning: unused function 'XLogGetRecordTotalLen'
[-Wunused-function]
726 | XLogGetRecordTotalLen(XLogRecord *record)
| ^~~~~~~~~~~~~~~~~~~~~

This seems to happen because XLogGetRecordTotalLen() is only used under
WAL_DEBUG. If that's correct, its definition should probably also be guarded
by WAL_DEBUG to avoid the warning.

cfbot reported a regression test failure with v5. Could you please
look into that?
https://cirrus-ci.com/build/5635306839343104

When I ran pg_waldump on WAL generated with wal_compression=pglz and
wal_compression_threshold=32, I got this error:

pg_waldump: error: error in WAL record at 0/02183BE0: could not
decompress record at 0/2183D10

Isn't this a bug?

+ XLogEnsureCompressionBuffer(MaxSizeOfXLogRecordBlockHeader + BLCKSZ);

XLogEnsureCompressionBuffer() is now called every time XLogRegisterBuffer(),
XLogRegisterBlock(), XLogRegisterData(), and XLogRegisterBufData() are invoked.
Why is that necessary? Wouldn't it be sufficient to
call XLogEnsureCompressionBuffer() once, with the total length,
just before XLogCompressRdt(rdt)?

v5 removes the ability to compress only full-page images, which is the current
wal_compression behavior. That may be disappointing for users who rely on
the existing semantics. Would it make more sense to keep the current behavior
and add a new feature to compress entire WAL records whose size exceeds
the specified threshold?

Regards,

--
Fujii Masao

#12

Andrey Borodin

amborodin@acm.org

about 2 months ago

In reply to: Fujii Masao (#11)

Re: Compression of bigger WAL records

Hi Fujii!

Thanks for the review, I'll address your feedback soon.

On 16 Jan 2026, at 20:44, Fujii Masao <masao.fujii@gmail.com> wrote:

Would it make more sense to keep the current behavior
and add a new feature to compress entire WAL records whose size exceeds
the specified threshold?

That's a very good idea! We don't need to replace current behavior, we can just complement it.
I'll implement this idea!

Best regards, Andrey Borodin.

#13

Andrey Borodin

amborodin@acm.org

about 12 hours ago

In reply to: Andrey Borodin (#12)

Re: Compression of bigger WAL records

On 16 Jan 2026, at 21:17, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

That's a very good idea! We don't need to replace current behavior, we can just complement it.
I'll implement this idea!

Here's the implementation. Previously existing buffers are now combined
into single allocation, which is GUC-controlled (you can add more memory).

However, now this buffer is just enough to accommodate most of records...
So, maybe we do not need a GUC at all, because keeping it minimal (same
consumption as before the patch) is just enough.

Now the patch essentially have no extra memory footprint, but allows to
save 25% of WAL on index creation (in case of random data).

User can force FPI-only compression by increasing wal_compression_threshold
to 1GB.

The decision chain is now a bit complicated:
- assemble record without compression FPIs
- try whole record compression
- if compression enlarged record fallback to FPI compression
I think the case can be simplified to "Try only one compression approach that
is expected to work, if not - insert uncompressed".

What do you think?

Best regards, Andrey Borodin.

#14

Zsolt Parragi

zsolt.parragi@percona.com

about 6 hours ago

In reply to: Andrey Borodin (#13)

Re: Compression of bigger WAL records

Hello

+static void
+AllocCompressionBuffers(void)
+{
+ uint32 new_size = wal_compression_buffer;

This is called in the assign hook - and isn't that called before the
global variable is updated?

+ compressed_header->method = XLR_COMPRESS_LZ4;
+ compr_len = LZ4_compress_default((char *) &src_header[1], (char *)
&compressed_header[1],
+    orig_len, compressed_data_size);

compressed_header[1] has an offset of 32, but compressed_data_size
refers to the entire size, isn't there a possible buffer overrun here?
Same with ZSTD.

+/* Header prepended to a whole-record compressed WAL record */
+typedef struct XLogCompressionHeader
+{
+ XLogRecord record_header;
+ uint8 method; /* XLR_COMPRESS_* */
+ uint32 decompressed_length;
+} XLogCompressionHeader;

This has 3 bytes of uninitialized padding, is that okay? I remember
seeing a separate thread about possibly cleaning these up not long
ago.

+# Enable WAL compression for recovery tests.
+# lz4 is used here; 052_wal_compression.pl separately tests all methods.
+wal_compression = 'lz4'

Doesn't this test needs a check to require a build with compression flags?

+#wal_compression = lz4 # enables compression of
full-page writes;

But the default value is still off, isn't this example misleading?

+ total_len_decomp = -1; /* XLogCompressionHeader spans pages */

at multiple places, but this is an unsigned variable.

+  variable => 'wal_compression_buffer',
+  boot_val => '295972',
+  min => '295972',

Doesn't guc_params.dat support using macros instead of this hardcoded
magic number?

Compression of bigger WAL records

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: