Force re-compression with lz4
Hello,
I have a table storing mostly text data (40M+ rows) that has pg_total_relation_size ~670GB.
I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 compression.
I’ve altered the column to use the new lz4 compression, but that only applies to new rows.
What’s the recommended way of triggering the re-evaluation for pre-existing rows?
I tried wrapping a function like the following, but apparently each old record retains the compression applied.
text_corpus=(SELECT t.text from ...);
delete from t where id=;
insert into t(id, text) values (id, text_corpus);
Fttb, I resorted to preparing an external shell script to execute against the db but that’s too slow as it moves data in&out the db.
Is there a smarter way to do this ?
Thanks,
Flo
On 10/17/21 10:12 AM, Florents Tselai wrote:
Hello,
I have a table storing mostly text data (40M+ rows) that has
pg_total_relation_size ~670GB.
I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4
compression.I’ve altered the column to use the new lz4 compression, but that only
applies to new rows.What’s the recommended way of triggering the re-evaluation for
pre-existing rows?I tried wrapping a function like the following, but apparently each old
record retains the compression applied.
text_corpus=(SELECT t.text from...); delete from t where id=; insert into
t(id, text) values (id, text_corpus);
Because it's all in one transaction?
Fttb, I resorted to preparing an external shell script to execute against
the db but that’s too slow as it moves data in&out the db.Is there a smarter way to do this ?
Even with in-place compression, you've got to read the uncompressed data.
Does your shell script process one record at a time? Maybe do ranges:
COPY (SELECT * FROM t WHERE id BETWEEN x AND y) TO '/some/file.csv';
DELETE FROM t WHERE id BETWEEN x AND y;
COPY t FROM '/some/file.csv';
--
Angular momentum makes the world go 'round.
On 10/17/21 11:36 AM, Ron wrote:
On 10/17/21 10:12 AM, Florents Tselai wrote:
Hello,
I have a table storing mostly text data (40M+ rows) that has
pg_total_relation_size ~670GB.
I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4
compression.I’ve altered the column to use the new lz4 compression, but that only
applies to new rows.What’s the recommended way of triggering the re-evaluation for
pre-existing rows?I tried wrapping a function like the following, but apparently each old
record retains the compression applied.
text_corpus=(SELECT t.text from...); delete from t where id=; insert into
t(id, text) values (id, text_corpus);Because it's all in one transaction?
Fttb, I resorted to preparing an external shell script to execute against
the db but that’s too slow as it moves data in&out the db.Is there a smarter way to do this ?
Even with in-place compression, you've got to read the uncompressed data.
Does your shell script process one record at a time? Maybe do ranges:
COPY (SELECT * FROM t WHERE id BETWEEN x AND y) TO '/some/file.csv';
DELETE FROM t WHERE id BETWEEN x AND y;
I forgot to mention:
VACUUM t;
COPY t FROM '/some/file.csv';
--
Angular momentum makes the world go 'round.
On Sun, Oct 17, 2021 at 5:12 PM Florents Tselai <florents.tselai@gmail.com>
wrote:
Hello,
I have a table storing mostly text data (40M+ rows) that has
pg_total_relation_size ~670GB.
I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4
compression.I’ve altered the column to use the new lz4 compression, but that only
applies to new rows.What’s the recommended way of triggering the re-evaluation for
pre-existing rows?I tried wrapping a function like the following, but apparently each old
record retains the compression applied.text_corpus=(SELECT t.text from ...);
delete from t where id=;
insert into t(id, text) values (id, text_corpus);
Fttb, I resorted to preparing an external shell script to execute against
the db but that’s too slow as it moves data in&out the db.Is there a smarter way to do this ?
It should be enough to VACUUM FULL the table. (but it has to be VACUUM
FULL, not a regular vacuum). Or CLUSTER.
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
On 10/17/21 10:17, Magnus Hagander wrote:
On Sun, Oct 17, 2021 at 5:12 PM Florents Tselai
<florents.tselai@gmail.com <mailto:florents.tselai@gmail.com>> wrote:
Is there a smarter way to do this ?
It should be enough to VACUUM FULL the table. (but it has to be VACUUM
FULL, not a regular vacuum). Or CLUSTER.
With the proviso that this will require double the existing space,
~670GB, until the operation is completed.
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
--
Adrian Klaver
adrian.klaver@aklaver.com
Yes, That COPY-delete-COPY sequence is what I ended up doing.
Unfortunately can’t use ranges as the PK its a text string.
Show quoted text
On 17 Oct 2021, at 7:36 PM, Ron <ronljohnsonjr@gmail.com> wrote:
On 10/17/21 10:12 AM, Florents Tselai wrote:
Hello,
I have a table storing mostly text data (40M+ rows) that has pg_total_relation_size ~670GB.
I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 compression.I’ve altered the column to use the new lz4 compression, but that only applies to new rows.
What’s the recommended way of triggering the re-evaluation for pre-existing rows?
I tried wrapping a function like the following, but apparently each old record retains the compression applied.
text_corpus=(SELECT t.text from ...);delete from t where id=;
insert into t(id, text) values (id, text_corpus);
Because it's all in one transaction?
Fttb, I resorted to preparing an external shell script to execute against the db but that’s too slow as it moves data in&out the db.
Is there a smarter way to do this ?
Even with in-place compression, you've got to read the uncompressed data.
Does your shell script process one record at a time? Maybe do ranges:
COPY (SELECT * FROM t WHERE id BETWEEN x AND y) TO '/some/file.csv';
DELETE FROM t WHERE id BETWEEN x AND y;
COPY t FROM '/some/file.csv';--
Angular momentum makes the world go 'round.
I did look into VACUUM(full) for it’s PROCESS_TOAST option which makes sense, but the thing is I already had a cron-ed VACUUM (full) which I ended up disabling a while back; exactly because of the double-space requirement.
The DB has already a 1TB size and occupying another 600MB would require some hassle. Thus, the external script approach makes more sense.
Show quoted text
On 17 Oct 2021, at 8:28 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 10/17/21 10:17, Magnus Hagander wrote:
On Sun, Oct 17, 2021 at 5:12 PM Florents Tselai <florents.tselai@gmail.com <mailto:florents.tselai@gmail.com>> wrote:
Is there a smarter way to do this ?
It should be enough to VACUUM FULL the table. (but it has to be VACUUM FULL, not a regular vacuum). Or CLUSTER.With the proviso that this will require double the existing space, ~670GB, until the operation is completed.
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>--
Adrian Klaver
adrian.klaver@aklaver.com
Florents Tselai wrote:
I have a table storing mostly text data (40M+ rows) that has
pg_total_relation_size ~670GB.
I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4
compression.
You could start experimenting with data samples rather than the
full contents.
FWIW, in my case I've found that the compression ratio of lz4 was only
marginally better than pglz (like 2% on text).
As for decompression time, it doesn't seem to differ significantly
from pglz, so overall, recompressing existing data did not seem
worth the trouble.
However lz4 appears to be much faster to compress than pglz, so its
benefit is clear in terms of CPU usage for future insertions.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite
On Sun, Oct 17, 2021 at 10:33:52PM +0200, Daniel Verite wrote:
However lz4 appears to be much faster to compress than pglz, so its
benefit is clear in terms of CPU usage for future insertions.
CPU-speaking, LZ4 is *much* faster than pglz when it comes to
compression or decompression with its default options. The
compression ratio is comparable between both, still LZ4 compresses in
average less than PGLZ.
--
Michael
On Sun, Oct 17, 2021 at 10:13:48PM +0300, Florents Tselai wrote:
I did look into VACUUM(full) for it’s PROCESS_TOAST option which
makes sense, but the thing is I already had a cron-ed VACUUM (full)
which I ended up disabling a while back; exactly because of the
double-space requirement.
Please note that VACUUM FULL does not enforce a recompression on
existing values. See commit dbab0c0, that disabled this choice as it
introduced a noticeable performance penality in some cases when
looking at the compression type of the vacuumed table attributes:
=# CREATE TABLE cmdata(f1 text COMPRESSION pglz);
CREATE TABLE
=# INSERT INTO cmdata VALUES(repeat('1234567890', 1000));
INSERT 0 1
=# SELECT pg_column_compression(f1) FROM cmdata;
pg_column_compression
-----------------------
pglz
(1 row)
=# ALTER TABLE cmdata ALTER COLUMN f1 SET COMPRESSION lz4;
ALTER TABLE
=# VACUUM FULL cmdata;
VACUUM
=# SELECT pg_column_compression(f1) FROM cmdata;
pg_column_compression
-----------------------
pglz
(1 row)
--
Michael
On Sun, 17 Oct 2021 at 21:04, Florents Tselai <florents.tselai@gmail.com> wrote:
Yes, That COPY-delete-COPY sequence is what I ended up doing.
Unfortunately can’t use ranges as the PK its a text string.
Unless you have a really weird PK and have trouble calculating bounds,
text strings are sortable and fine to use as ranges. If you use
half-open intervals you no not even need to query them all (i.e., do a
batch fo key<'aa', then one for >='aa', <'ab', repeat, terminate with
a >='zz' ( substitute letter pairs for whatever you think will
partition your keys in adequate chunks ) ( you can find adequate
bounds scanning the pkindex and skipping, just rememberto sue
half-open intervals and cover all the key domain ).
Francisco Olarte.
Oh, that’s good to know then. So besides ALTER COMPRESSION for future inserts there’s not much one can do for pre-existing values
I think it makes sense to update/ add more info to the docs on this as well, since other people in the thread expected this to work that way too.
Maybe at some point, even allow an explicit option to be defined during VACUUM ?
Show quoted text
On 18 Oct 2021, at 8:18 AM, Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Oct 17, 2021 at 10:13:48PM +0300, Florents Tselai wrote:
I did look into VACUUM(full) for it’s PROCESS_TOAST option which
makes sense, but the thing is I already had a cron-ed VACUUM (full)
which I ended up disabling a while back; exactly because of the
double-space requirement.Please note that VACUUM FULL does not enforce a recompression on
existing values. See commit dbab0c0, that disabled this choice as it
introduced a noticeable performance penality in some cases when
looking at the compression type of the vacuumed table attributes:
=# CREATE TABLE cmdata(f1 text COMPRESSION pglz);
CREATE TABLE
=# INSERT INTO cmdata VALUES(repeat('1234567890', 1000));
INSERT 0 1
=# SELECT pg_column_compression(f1) FROM cmdata;
pg_column_compression
-----------------------
pglz
(1 row)
=# ALTER TABLE cmdata ALTER COLUMN f1 SET COMPRESSION lz4;
ALTER TABLE
=# VACUUM FULL cmdata;
VACUUM
=# SELECT pg_column_compression(f1) FROM cmdata;
pg_column_compression
-----------------------
pglz
(1 row)
--
Michael
On Mon, Oct 18, 2021 at 09:57:11AM +0300, Florents Tselai wrote:
Oh, that’s good to know then. So besides ALTER COMPRESSION for
future inserts there’s not much one can do for pre-existing values
The posting style of the mailing list is to not top-post, so if you
could avoid breaking the logic of the thread, that would be nice :)
I think it makes sense to update/ add more info to the docs on this
as well, since other people in the thread expected this to work that
way too.
There is some documentation, as changing the compression for an
existing table is part of ALTER TABLE:
https://www.postgresql.org/docs/current/sql-altertable.html
"This does not cause the table to be rewritten, so existing data may
still be compressed with other compression methods. If the table is
restored with pg_restore, then all values are rewritten with the
configured compression method."
Maybe at some point, even allow an explicit option to be defined during VACUUM ?
That's a part where we disagreed as it should not be VACUUM's work to
do that. The option would have a limited impact as it comes to users
that would do a one-time operation most likely part of an upgrade, so
I don't think that this would be adapted to have anyway.
--
Michael
On Mon, Oct 18, 2021 at 7:18 AM Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Oct 17, 2021 at 10:13:48PM +0300, Florents Tselai wrote:
I did look into VACUUM(full) for it’s PROCESS_TOAST option which
makes sense, but the thing is I already had a cron-ed VACUUM (full)
which I ended up disabling a while back; exactly because of the
double-space requirement.Please note that VACUUM FULL does not enforce a recompression on
existing values. See commit dbab0c0, that disabled this choice as it
introduced a noticeable performance penality in some cases when
looking at the compression type of the vacuumed table attributes:
Oh dang, I missed that this was reverted. Thanks for pointing that out!
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
On 10/18/21 01:07, Michael Paquier wrote:
CPU-speaking, LZ4 is*much* faster than pglz when it comes to
compression or decompression with its default options. The
compression ratio is comparable between both, still LZ4 compresses in
average less than PGLZ.
--
Michael
LZ4 works much better with deduplication tools like Data Domain or Data
Domain Boost (client side deduplication). With zip or gzip compression,
deduplication ratios are much lower than with LZ4. Most of the modern
backup tools (DD, Veeam, Rubrik, Commvault) support deduplication. LZ4
algorithm uses less CPU than zip, gzip or bzip2 and works much better
with deduplication algorithms employed by the backup tools. This is
actually a very big and positive change.
Disclosure:
I used to work for Commvault as a senior PS engineer. Commvault was the
first tool on the market to combine LZ4 and deduplication.
Regards
--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com
On 10/18/21 06:41, Mladen Gogala wrote:
On 10/18/21 01:07, Michael Paquier wrote:
CPU-speaking, LZ4 is*much* faster than pglz when it comes to
compression or decompression with its default options. The
compression ratio is comparable between both, still LZ4 compresses in
average less than PGLZ.
--
MichaelLZ4 works much better with deduplication tools like Data Domain or Data
Domain Boost (client side deduplication). With zip or gzip compression,
deduplication ratios are much lower than with LZ4. Most of the modern
backup tools (DD, Veeam, Rubrik, Commvault) support deduplication. LZ4
algorithm uses less CPU than zip, gzip or bzip2 and works much better
with deduplication algorithms employed by the backup tools. This is
actually a very big and positive change.
Not sure how much this applies to the Postgres usage of lz4. As I
understand it, this is only used internally for table compression. When
using pg_dump compression gzip is used. Unless you pipe plain text
output through some other program.
Disclosure:
I used to work for Commvault as a senior PS engineer. Commvault was the
first tool on the market to combine LZ4 and deduplication.Regards
--
Adrian Klaver
adrian.klaver@aklaver.com
On Mon, Oct 18, 2021 at 08:01:04AM -0700, Adrian Klaver wrote:
Not sure how much this applies to the Postgres usage of lz4. As I understand
it, this is only used internally for table compression. When using pg_dump
compression gzip is used. Unless you pipe plain text output through some
other program.
More precisely, LZ4 applies to the compression of toastable values in
14~. In 15~, we can already use it for WAL and the compression of
full-page writes.
It is worth noting that there are extra patches floating around to add
more LZ4 pluggability to pg_dump (I think this has not been published
yet), pg_receivewal (published) and base backups through the
replication protocol (published). I have seen rather good numbers
when it came to WAL, FWIW. Even if the compression ratio was a bit
less than pglz, it was much faster.
--
Michael
On 10/18/21 11:01, Adrian Klaver wrote:
ot sure how much this applies to the Postgres usage of lz4. As I
understand it, this is only used internally for table compression.
When using pg_dump compression gzip is used. Unless you pipe plain
text output through some other program.
This applies when using a 3rd party commercial backup tool with
deduplication. You'd be surprised how many people do that. One tool to
backup them all, one tool to find them and on the LTO cartridge backup
them. I apologize for this cheesy paraphrase of Tolkien, but I couldn't
resist. Long story short, for the 3rd party backup tools LZO4
compression will yield better deduplication ratios than other forms of
compression, thereby saving you space.
--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com