Proposal: Adding compression of temporary files

Started by Filip Janus3 months ago32 messageshackers
Jump to latest
#1Filip Janus
fjanus@redhat.com

Hi all,
Postgresql supports data compression nowadays, but the compression of
temporary files has not been implemented yet. The huge queries can
produce a significant amount of temporary data that needs to be stored on
disk
and cause many expensive I/O operations.
I am attaching a proposal of the patch to enable temporary files
compression for
hashjoins for now. Initially, I've chosen the LZ4 compression algorithm. It
would
probably make better sense to start with pglz, but I realized it late.

# Future possible improvements
Reducing the number of memory allocations within the dumping and loading of
the buffer. I have two ideas for solving this problem. I would either add a
buffer into
struct BufFile or provide the buffer as an argument from the caller. For
the sequential
execution, I would prefer the second option.

# Future plan/open questions
In the future, I would like to add support for pglz and zstd. Further, I
plan to
extend the support of the temporary file compression also for sorting, gist
index creation, etc.

Experimenting with the stream mode of compression algorithms. The
compression
ratio of LZ4 in block mode seems to be satisfying, but the stream mode
could
produce a better ratio, but it would consume more memory due to the
requirement to store
context for LZ4 stream compression.

# Benchmark
I prepared three different databases to check expectations. Each
dataset is described below. My testing demonstrates that my patch
improves the execution time of huge hash joins.
Also, my implementation should not
negatively affect performance within smaller queries.
The usage of memory needed for temporary files was reduced in every
execution without a significant impact on execution time.

*## Dataset A:*
Tables
table_a(bigint id,text data_text,integer data_number) - 10000000 rows
table_b(bigint id, integer ref_id, numeric data_value, bytea data_blob) -
10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id = b.id;

The tables contain highly compressible data.
The query demonstrated a reduction in the usage of the temporary
files ~20GB -> 3GB, based on this reduction also caused the execution
time of the query to be reduced by about ~10s.

*## Dataset B:*
Tables:
table_a(integer id, text data_blob) - 1110000 rows
table_b(integer id, text data_blob) - 10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id = b.id;

The tables contain less compressible data. data_blob was generated by a
pseudo-random generator.
In this case, the data reduction was only ~50%. Also, the execution time
was reduced
only slightly with the enabled compression.

The second scenario demonstrates no overhead in the case of enabled
compression and extended work_mem to avoid temp file usage.

*## Dataset C:*
Tables
customers (integer,text,text,text,text)
order_items(integer,integer,integer,integer,numeric(10,2))
orders(integer,integer,timestamp,numeric(10,2))
products(integer,text,text,numeric(10,2),integer)

Query: SELECT p.product_id, p.name, p.price, SUM(oi.quantity) AS
total_quantity, AVG(oi.price) AS avg_item_price
FROM eshop.products p JOIN eshop.order_items oi ON p.product_id =
oi.product_id JOIN
eshop.orders o ON oi.order_id = o.order_id WHERE o.order_date > '2020-01-01'
AND p.price > 50
GROUP BY p.product_id, p.name, p.price HAVING SUM(oi.quantity) > 1000
ORDER BY total_quantity DESC LIMIT 100;

This scenario should demonstrate a more realistic usage of the database.
Enabled compression slightly reduced the temporary memory usage, but the
execution
time wasn't affected by compression.

+------------+-------------------------+-----------------------+------------------------------+
|  Dataset   | Compression.       | temp_bytes         | Execution Time
(ms)   |
+------------+-------------------------+-----------------------+-----------------------------
+
| A             | Yes                        |  3.09 GiB            |
22s586ms           | work_mem  = 4MB
|                | No                         |  21.89 GiB          |
35s                       | work_mem  = 4MB
+------------+-------------------------+-----------------------+----------------------------------------
| B             | Yes                        |  333 MB               |
1815.545 ms       | work_mem = 4MB
|                 | No                        |  146  MB              |
1500.460 ms        | work_mem = 4MB
|                 | Yes                       |  0 MB
| 3262.305 ms        | work_mem = 80MB
|                 | No                        |  0 MB
| 3174.725 ms         | work_mem = 80MB
+-------------+------------------------+------------------------+-------------------------------------
| C             | Yes                       | 40 MB
| 1011.020 ms        | work_mem = 1MB
|                | No                        |  53 MB                 |
1034.142 ms        | work_mem = 1MB
+------------+------------------------+------------------------+--------------------------------------

Regards,

-Filip-

Attachments:

0001-This-commit-adds-support-for-temporary-files-compres.patchapplication/octet-stream; name=0001-This-commit-adds-support-for-temporary-files-compres.patchDownload+200-16
#2Filip Janus
fjanus@redhat.com
In reply to: Filip Janus (#1)
Re: Proposal: Adding compression of temporary files

Let's fix the compiler warning caused by an uninitialized local variable.

-Filip-

čt 14. 11. 2024 v 23:13 odesílatel Filip Janus <fjanus@redhat.com> napsal:

Show quoted text

Hi all,
Postgresql supports data compression nowadays, but the compression of
temporary files has not been implemented yet. The huge queries can
produce a significant amount of temporary data that needs to be stored on
disk
and cause many expensive I/O operations.
I am attaching a proposal of the patch to enable temporary files
compression for
hashjoins for now. Initially, I've chosen the LZ4 compression algorithm.
It would
probably make better sense to start with pglz, but I realized it late.

# Future possible improvements
Reducing the number of memory allocations within the dumping and loading of
the buffer. I have two ideas for solving this problem. I would either add
a buffer into
struct BufFile or provide the buffer as an argument from the caller. For
the sequential
execution, I would prefer the second option.

# Future plan/open questions
In the future, I would like to add support for pglz and zstd. Further, I
plan to
extend the support of the temporary file compression also for sorting,
gist index creation, etc.

Experimenting with the stream mode of compression algorithms. The
compression
ratio of LZ4 in block mode seems to be satisfying, but the stream mode
could
produce a better ratio, but it would consume more memory due to the
requirement to store
context for LZ4 stream compression.

# Benchmark
I prepared three different databases to check expectations. Each
dataset is described below. My testing demonstrates that my patch
improves the execution time of huge hash joins.
Also, my implementation should not
negatively affect performance within smaller queries.
The usage of memory needed for temporary files was reduced in every
execution without a significant impact on execution time.

*## Dataset A:*
Tables
table_a(bigint id,text data_text,integer data_number) - 10000000 rows
table_b(bigint id, integer ref_id, numeric data_value, bytea data_blob) -
10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id = b.id;

The tables contain highly compressible data.
The query demonstrated a reduction in the usage of the temporary
files ~20GB -> 3GB, based on this reduction also caused the execution
time of the query to be reduced by about ~10s.

*## Dataset B:*
Tables:
table_a(integer id, text data_blob) - 1110000 rows
table_b(integer id, text data_blob) - 10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id = b.id;

The tables contain less compressible data. data_blob was generated by a
pseudo-random generator.
In this case, the data reduction was only ~50%. Also, the execution time
was reduced
only slightly with the enabled compression.

The second scenario demonstrates no overhead in the case of enabled
compression and extended work_mem to avoid temp file usage.

*## Dataset C:*
Tables
customers (integer,text,text,text,text)
order_items(integer,integer,integer,integer,numeric(10,2))
orders(integer,integer,timestamp,numeric(10,2))
products(integer,text,text,numeric(10,2),integer)

Query: SELECT p.product_id, p.name, p.price, SUM(oi.quantity) AS
total_quantity, AVG(oi.price) AS avg_item_price
FROM eshop.products p JOIN eshop.order_items oi ON p.product_id =
oi.product_id JOIN
eshop.orders o ON oi.order_id = o.order_id WHERE o.order_date >
'2020-01-01' AND p.price > 50
GROUP BY p.product_id, p.name, p.price HAVING SUM(oi.quantity) > 1000
ORDER BY total_quantity DESC LIMIT 100;

This scenario should demonstrate a more realistic usage of the database.
Enabled compression slightly reduced the temporary memory usage, but the
execution
time wasn't affected by compression.

+------------+-------------------------+-----------------------+------------------------------+
|  Dataset   | Compression.       | temp_bytes         | Execution Time
(ms)   |
+------------+-------------------------+-----------------------+-----------------------------
+
| A             | Yes                        |  3.09 GiB            |
22s586ms           | work_mem  = 4MB
|                | No                         |  21.89 GiB          |
35s                       | work_mem  = 4MB

+------------+-------------------------+-----------------------+----------------------------------------
| B | Yes | 333 MB |
1815.545 ms | work_mem = 4MB
| | No | 146 MB |
1500.460 ms | work_mem = 4MB
| | Yes | 0 MB
| 3262.305 ms | work_mem = 80MB
| | No | 0 MB
| 3174.725 ms | work_mem = 80MB

+-------------+------------------------+------------------------+-------------------------------------
| C | Yes | 40 MB
| 1011.020 ms | work_mem = 1MB
| | No | 53 MB |
1034.142 ms | work_mem = 1MB

+------------+------------------------+------------------------+--------------------------------------

Regards,

-Filip-

Attachments:

0001-This-commit-adds-support-for-temporary-files-compres-v2.patchapplication/octet-stream; name=0001-This-commit-adds-support-for-temporary-files-compres-v2.patchDownload+200-16
#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Filip Janus (#2)
Re: Proposal: Adding compression of temporary files

Hi,

On 11/18/24 22:58, Filip Janus wrote:

...
Hi all,
Postgresql supports data compression nowadays, but the compression of
temporary files has not been implemented yet. The huge queries can 
produce a significant amount of temporary data that needs to
be stored on disk 
and cause many expensive I/O operations.
I am attaching a proposal of the patch to enable temporary files
compression for
hashjoins for now. Initially, I've chosen the LZ4 compression
algorithm. It would
probably make better sense to start with pglz, but I realized it late.

Thanks for the idea & patch. I agree this might be quite useful for
workloads generating a lot of temporary files for stuff like sorts etc.
I think it will be interesting to think about the trade offs, i.e. how
to pick the compression level - at some point the compression ratio
stops improving while paying more and more CPU time. Not sure what the
right choice is, so using default seems fine.

I agree it'd be better to start with pglz, and only then add lz4 etc.
Firstly, pglz is simply the built-in compression, supported everywhere.
And it's also simpler to implement, I think.

# Future possible improvements
Reducing the number of memory allocations within the dumping and
loading of
the buffer. I have two ideas for solving this problem. I would
either add a buffer into
struct BufFile or provide the buffer as an argument from the caller.
For the sequential 
execution, I would prefer the second option.

Yes, this would be good. Doing a palloc+pfree for each compression is
going to be expensive, especially because these buffers are going to be
large - likely larger than 8kB. Which means it's not cached in the
memory context, etc.

Adding it to the BufFile is not going to fly, because that doubles the
amount of memory per file. And we already have major issues with hash
joins consuming massive amounts of memory. But at the same time the
buffer is only needed during compression, and there's only one at a
time. So I agree with passing a single buffer as an argument.

# Future plan/open questions
In the future, I would like to add support for pglz and zstd.
Further, I plan to
extend the support of the temporary file compression also for
sorting, gist index creation, etc.

Experimenting with the stream mode of compression algorithms. The
compression 
ratio of LZ4 in block mode seems to be satisfying, but the stream
mode could 
produce a better ratio, but it would consume more memory due to the
requirement to store
context for LZ4 stream compression.

One thing I realized is that this only enables temp file compression for
a single place - hash join spill files. AFAIK this is because compressed
files don't support random access, and the other places might need that.

Is that correct? The patch does not explain this anywhere. If that's
correct, the patch probably should mention this in a comment for the
'compress' argument added to BufFileCreateTemp(), so that it's clear
when it's legal to set compress=true.

Which other places might compress temp files? Surely hash joins are not
the only place that could benefit from this, right?

Another thing is testing. If I run regression tests, it won't use
compression at all, because the GUC has "none" by default, right? But we
need some testing, so how would we do that? One option would be to add a
regression test that explicitly sets the GUC and does a hash join, but
that won't work with lz4 (because that may not be enabled).

Another option might be to add a PG_TEST_xxx environment variable that
determines compression to use. Something like PG_TEST_USE_UNIX_SOCKETS.
But perhaps there's a simpler way.

# Benchmark
I prepared three different databases to check expectations. Each 
dataset is described below. My testing demonstrates that my patch 
improves the execution time of huge hash joins. 
Also, my implementation should not
negatively affect performance within smaller queries. 
The usage of memory needed for temporary files was reduced in every
 execution without a significant impact on execution time.

*## Dataset A:*
Tables*
*
table_a(bigint id,text data_text,integer data_number) - 10000000 rows
table_b(bigint id, integer ref_id, numeric data_value, bytea
data_blob) - 10000000 rows
Query:  SELECT *  FROM table_a a JOIN table_b b ON a.id <http://
a.id> = b.id <http://b.id&gt;;

The tables contain highly compressible data.
The query demonstrated a reduction in the usage of the temporary 
files ~20GB -> 3GB, based on this reduction also caused the execution 
time of the query to be reduced by about ~10s.

*## Dataset B:*
Tables:*
*
table_a(integer id, text data_blob) - 1110000 rows
table_b(integer id, text data_blob) - 10000000 rows
Query:  SELECT *  FROM table_a a JOIN table_b b ON a.id <http://
a.id> = b.id <http://b.id&gt;;

The tables contain less compressible data. data_blob was generated
by a pseudo-random generator.
In this case, the data reduction was only ~50%. Also, the execution
time was reduced 
only slightly with the enabled compression.

The second scenario demonstrates no overhead in the case of enabled 
compression and extended work_mem to avoid temp file usage.

*## Dataset C:*
Tables
customers (integer,text,text,text,text)
order_items(integer,integer,integer,integer,numeric(10,2))
orders(integer,integer,timestamp,numeric(10,2))
products(integer,text,text,numeric(10,2),integer)

Query: SELECT p.product_id, p.name <http://p.name&gt;, p.price,
SUM(oi.quantity) AS total_quantity, AVG(oi.price) AS avg_item_price
FROM eshop.products p JOIN eshop.order_items oi ON p.product_id =
oi.product_id JOIN 
eshop.orders o ON oi.order_id = o.order_id WHERE o.order_date >
'2020-01-01' AND p.price > 50
GROUP BY p.product_id, p.name <http://p.name&gt;, p.price HAVING
SUM(oi.quantity) > 1000
ORDER BY total_quantity DESC LIMIT 100;

This scenario should demonstrate a more realistic usage of the database.
Enabled compression slightly reduced the temporary memory usage, but
the execution
time wasn't affected by compression.

+------------+-------------------------+-----------------------
+------------------------------+
|  Dataset   | Compression.       | temp_bytes         | Execution
Time (ms)   |      
+------------+-------------------------+-----------------------
+----------------------------- +
| A             | Yes                        |  3.09 GiB           
| 22s586ms           | work_mem  = 4MB
|                | No                         |  21.89 GiB         
| 35s                       | work_mem  = 4MB
+------------+-------------------------+-----------------------
+----------------------------------------
| B             | Yes                        |  333 MB              
| 1815.545 ms       | work_mem = 4MB
|                 | No                        |  146  MB           
  | 1500.460 ms        | work_mem = 4MB
|                 | Yes                       |  0 MB              
    | 3262.305 ms        | work_mem = 80MB
|                 | No                        |  0 MB               
   | 3174.725 ms         | work_mem = 80MB
+-------------+------------------------+------------------------
+-------------------------------------
| C             | Yes                       | 40 MB                 
| 1011.020 ms        | work_mem = 1MB
|                | No                        |  53
MB                 |  1034.142 ms        | work_mem = 1MB
+------------+------------------------+------------------------
+--------------------------------------

Thanks. I'll try to do some benchmarks on my own.

Are these results fro ma single run, or an average of multiple runs? Do
you maybe have a script to reproduce this, including the data generation?

Also, can you share some information about the machine used for this? I
expect the impact to strongly depends on memory pressure - if the temp
file fits into page cache (and stays there), it may not benefit from the
compression, right?

regards

--
Tomas Vondra

#4Filip Janus
fjanus@redhat.com
In reply to: Tomas Vondra (#3)
Re: Proposal: Adding compression of temporary files

-Filip-

st 20. 11. 2024 v 1:35 odesílatel Tomas Vondra <tomas@vondra.me> napsal:

Hi,

On 11/18/24 22:58, Filip Janus wrote:

...
Hi all,
Postgresql supports data compression nowadays, but the compression of
temporary files has not been implemented yet. The huge queries can
produce a significant amount of temporary data that needs to
be stored on disk
and cause many expensive I/O operations.
I am attaching a proposal of the patch to enable temporary files
compression for
hashjoins for now. Initially, I've chosen the LZ4 compression
algorithm. It would
probably make better sense to start with pglz, but I realized it

late.

Thanks for the idea & patch. I agree this might be quite useful for
workloads generating a lot of temporary files for stuff like sorts etc.
I think it will be interesting to think about the trade offs, i.e. how
to pick the compression level - at some point the compression ratio
stops improving while paying more and more CPU time. Not sure what the
right choice is, so using default seems fine.

I agree it'd be better to start with pglz, and only then add lz4 etc.
Firstly, pglz is simply the built-in compression, supported everywhere.
And it's also simpler to implement, I think.

# Future possible improvements
Reducing the number of memory allocations within the dumping and
loading of
the buffer. I have two ideas for solving this problem. I would
either add a buffer into
struct BufFile or provide the buffer as an argument from the caller.
For the sequential
execution, I would prefer the second option.

Yes, this would be good. Doing a palloc+pfree for each compression is
going to be expensive, especially because these buffers are going to be
large - likely larger than 8kB. Which means it's not cached in the
memory context, etc.

Adding it to the BufFile is not going to fly, because that doubles the
amount of memory per file. And we already have major issues with hash
joins consuming massive amounts of memory. But at the same time the
buffer is only needed during compression, and there's only one at a
time. So I agree with passing a single buffer as an argument.

# Future plan/open questions
In the future, I would like to add support for pglz and zstd.
Further, I plan to
extend the support of the temporary file compression also for
sorting, gist index creation, etc.

Experimenting with the stream mode of compression algorithms. The
compression
ratio of LZ4 in block mode seems to be satisfying, but the stream
mode could
produce a better ratio, but it would consume more memory due to the
requirement to store
context for LZ4 stream compression.

One thing I realized is that this only enables temp file compression for
a single place - hash join spill files. AFAIK this is because compressed
files don't support random access, and the other places might need that.

Is that correct? The patch does not explain this anywhere. If that's
correct, the patch probably should mention this in a comment for the
'compress' argument added to BufFileCreateTemp(), so that it's clear
when it's legal to set compress=true.

I will add the description there.

Which other places might compress temp files? Surely hash joins are not
the only place that could benefit from this, right?

Yes, you are definitely right. I have chosen the hash joins as a POC
because
there are no seeks besides seeks at the beginning of the buffer.
I have focused on hashjoins, but there are definitely also other places
where
the compression could be used. I want to add support in other places
in the feature.

Another thing is testing. If I run regression tests, it won't use
compression at all, because the GUC has "none" by default, right? But we
need some testing, so how would we do that? One option would be to add a
regression test that explicitly sets the GUC and does a hash join, but
that won't work with lz4 (because that may not be enabled).

Right, it's "none" by default. My opinion is that we would like to test
every supported compression method, so I will try to add environment
variable as
you recommended.

Another option might be to add a PG_TEST_xxx environment variable that
determines compression to use. Something like PG_TEST_USE_UNIX_SOCKETS.
But perhaps there's a simpler way.

# Benchmark
I prepared three different databases to check expectations. Each
dataset is described below. My testing demonstrates that my patch
improves the execution time of huge hash joins.
Also, my implementation should not
negatively affect performance within smaller queries.
The usage of memory needed for temporary files was reduced in every
execution without a significant impact on execution time.

*## Dataset A:*
Tables*
*
table_a(bigint id,text data_text,integer data_number) - 10000000 rows
table_b(bigint id, integer ref_id, numeric data_value, bytea
data_blob) - 10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id <http://
a.id> = b.id <http://b.id&gt;;

The tables contain highly compressible data.
The query demonstrated a reduction in the usage of the temporary
files ~20GB -> 3GB, based on this reduction also caused the

execution

time of the query to be reduced by about ~10s.

*## Dataset B:*
Tables:*
*
table_a(integer id, text data_blob) - 1110000 rows
table_b(integer id, text data_blob) - 10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id <http://
a.id> = b.id <http://b.id&gt;;

The tables contain less compressible data. data_blob was generated
by a pseudo-random generator.
In this case, the data reduction was only ~50%. Also, the execution
time was reduced
only slightly with the enabled compression.

The second scenario demonstrates no overhead in the case of enabled
compression and extended work_mem to avoid temp file usage.

*## Dataset C:*
Tables
customers (integer,text,text,text,text)
order_items(integer,integer,integer,integer,numeric(10,2))
orders(integer,integer,timestamp,numeric(10,2))
products(integer,text,text,numeric(10,2),integer)

Query: SELECT p.product_id, p.name <http://p.name&gt;, p.price,
SUM(oi.quantity) AS total_quantity, AVG(oi.price) AS avg_item_price
FROM eshop.products p JOIN eshop.order_items oi ON p.product_id =
oi.product_id JOIN
eshop.orders o ON oi.order_id = o.order_id WHERE o.order_date >
'2020-01-01' AND p.price > 50
GROUP BY p.product_id, p.name <http://p.name&gt;, p.price HAVING
SUM(oi.quantity) > 1000
ORDER BY total_quantity DESC LIMIT 100;

This scenario should demonstrate a more realistic usage of the

database.

Enabled compression slightly reduced the temporary memory usage, but
the execution
time wasn't affected by compression.

+------------+-------------------------+-----------------------
+------------------------------+
|  Dataset   | Compression.       | temp_bytes         | Execution
Time (ms)   |
+------------+-------------------------+-----------------------
+----------------------------- +
| A             | Yes                        |  3.09 GiB
| 22s586ms           | work_mem  = 4MB
|                | No                         |  21.89 GiB
| 35s                       | work_mem  = 4MB
+------------+-------------------------+-----------------------
+----------------------------------------
| B             | Yes                        |  333 MB
| 1815.545 ms       | work_mem = 4MB
|                 | No                        |  146  MB
| 1500.460 ms        | work_mem = 4MB
|                 | Yes                       |  0 MB
| 3262.305 ms        | work_mem = 80MB
|                 | No                        |  0 MB
| 3174.725 ms         | work_mem = 80MB
+-------------+------------------------+------------------------
+-------------------------------------
| C             | Yes                       | 40 MB
| 1011.020 ms        | work_mem = 1MB
|                | No                        |  53
MB                 |  1034.142 ms        | work_mem = 1MB
+------------+------------------------+------------------------
+--------------------------------------

Thanks. I'll try to do some benchmarks on my own.

Are these results fro ma single run, or an average of multiple runs?

It is average from multiple runs.

Do

you maybe have a script to reproduce this, including the data generation?

I am attaching my SQL file for database preparation. I also did further
testing
with two other machines( see attachment huge_tables.rtf ).

Also, can you share some information about the machine used for this? I
expect the impact to strongly depends on memory pressure - if the temp
file fits into page cache (and stays there), it may not benefit from the
compression, right?

If it fits into the page cache due to compression, I would consider it as a
benefit from compression.
I performed further testing on machines with different memory sizes.
Both experiments showed that compression was beneficial for execution time.
The execution time reduction was more significant in the case of the
machine that had
less memory available.

Tests were performed on:
MacBook PRO M3 36GB - MacOs
Virtual machine ARM64 10GB/ 6CPU - Fedora 39

Show quoted text

regards

--
Tomas Vondra

Attachments:

huge_table.rtftext/rtf; charset=US-ASCII; name=huge_table.rtfDownload
lz4.sqlapplication/octet-stream; name=lz4.sqlDownload
#5Filip Janus
fjanus@redhat.com
In reply to: Filip Janus (#4)
Re: Proposal: Adding compression of temporary files

I've added a regression test for lz4 compression if the server is compiled
with the "--with-lz4" option.

-Filip-

ne 24. 11. 2024 v 15:53 odesílatel Filip Janus <fjanus@redhat.com> napsal:

Show quoted text

-Filip-

st 20. 11. 2024 v 1:35 odesílatel Tomas Vondra <tomas@vondra.me> napsal:

Hi,

On 11/18/24 22:58, Filip Janus wrote:

...
Hi all,
Postgresql supports data compression nowadays, but the compression

of

temporary files has not been implemented yet. The huge queries can
produce a significant amount of temporary data that needs to
be stored on disk
and cause many expensive I/O operations.
I am attaching a proposal of the patch to enable temporary files
compression for
hashjoins for now. Initially, I've chosen the LZ4 compression
algorithm. It would
probably make better sense to start with pglz, but I realized it

late.

Thanks for the idea & patch. I agree this might be quite useful for
workloads generating a lot of temporary files for stuff like sorts etc.
I think it will be interesting to think about the trade offs, i.e. how
to pick the compression level - at some point the compression ratio
stops improving while paying more and more CPU time. Not sure what the
right choice is, so using default seems fine.

I agree it'd be better to start with pglz, and only then add lz4 etc.
Firstly, pglz is simply the built-in compression, supported everywhere.
And it's also simpler to implement, I think.

# Future possible improvements
Reducing the number of memory allocations within the dumping and
loading of
the buffer. I have two ideas for solving this problem. I would
either add a buffer into
struct BufFile or provide the buffer as an argument from the caller.
For the sequential
execution, I would prefer the second option.

Yes, this would be good. Doing a palloc+pfree for each compression is
going to be expensive, especially because these buffers are going to be
large - likely larger than 8kB. Which means it's not cached in the
memory context, etc.

Adding it to the BufFile is not going to fly, because that doubles the
amount of memory per file. And we already have major issues with hash
joins consuming massive amounts of memory. But at the same time the
buffer is only needed during compression, and there's only one at a
time. So I agree with passing a single buffer as an argument.

# Future plan/open questions
In the future, I would like to add support for pglz and zstd.
Further, I plan to
extend the support of the temporary file compression also for
sorting, gist index creation, etc.

Experimenting with the stream mode of compression algorithms. The
compression
ratio of LZ4 in block mode seems to be satisfying, but the stream
mode could
produce a better ratio, but it would consume more memory due to the
requirement to store
context for LZ4 stream compression.

One thing I realized is that this only enables temp file compression for
a single place - hash join spill files. AFAIK this is because compressed
files don't support random access, and the other places might need that.

Is that correct? The patch does not explain this anywhere. If that's
correct, the patch probably should mention this in a comment for the
'compress' argument added to BufFileCreateTemp(), so that it's clear
when it's legal to set compress=true.

I will add the description there.

Which other places might compress temp files? Surely hash joins are not
the only place that could benefit from this, right?

Yes, you are definitely right. I have chosen the hash joins as a POC
because
there are no seeks besides seeks at the beginning of the buffer.
I have focused on hashjoins, but there are definitely also other places
where
the compression could be used. I want to add support in other places
in the feature.

Another thing is testing. If I run regression tests, it won't use
compression at all, because the GUC has "none" by default, right? But we
need some testing, so how would we do that? One option would be to add a
regression test that explicitly sets the GUC and does a hash join, but
that won't work with lz4 (because that may not be enabled).

Right, it's "none" by default. My opinion is that we would like to test
every supported compression method, so I will try to add environment
variable as
you recommended.

Another option might be to add a PG_TEST_xxx environment variable that
determines compression to use. Something like PG_TEST_USE_UNIX_SOCKETS.
But perhaps there's a simpler way.

# Benchmark
I prepared three different databases to check expectations. Each
dataset is described below. My testing demonstrates that my patch
improves the execution time of huge hash joins.
Also, my implementation should not
negatively affect performance within smaller queries.
The usage of memory needed for temporary files was reduced in every
execution without a significant impact on execution time.

*## Dataset A:*
Tables*
*
table_a(bigint id,text data_text,integer data_number) - 10000000

rows

table_b(bigint id, integer ref_id, numeric data_value, bytea
data_blob) - 10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id <http://
a.id> = b.id <http://b.id&gt;;

The tables contain highly compressible data.
The query demonstrated a reduction in the usage of the temporary
files ~20GB -> 3GB, based on this reduction also caused the

execution

time of the query to be reduced by about ~10s.

*## Dataset B:*
Tables:*
*
table_a(integer id, text data_blob) - 1110000 rows
table_b(integer id, text data_blob) - 10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id <http://
a.id> = b.id <http://b.id&gt;;

The tables contain less compressible data. data_blob was generated
by a pseudo-random generator.
In this case, the data reduction was only ~50%. Also, the execution
time was reduced
only slightly with the enabled compression.

The second scenario demonstrates no overhead in the case of enabled
compression and extended work_mem to avoid temp file usage.

*## Dataset C:*
Tables
customers (integer,text,text,text,text)
order_items(integer,integer,integer,integer,numeric(10,2))
orders(integer,integer,timestamp,numeric(10,2))
products(integer,text,text,numeric(10,2),integer)

Query: SELECT p.product_id, p.name <http://p.name&gt;, p.price,
SUM(oi.quantity) AS total_quantity, AVG(oi.price) AS avg_item_price
FROM eshop.products p JOIN eshop.order_items oi ON p.product_id =
oi.product_id JOIN
eshop.orders o ON oi.order_id = o.order_id WHERE o.order_date >
'2020-01-01' AND p.price > 50
GROUP BY p.product_id, p.name <http://p.name&gt;, p.price HAVING
SUM(oi.quantity) > 1000
ORDER BY total_quantity DESC LIMIT 100;

This scenario should demonstrate a more realistic usage of the

database.

Enabled compression slightly reduced the temporary memory usage, but
the execution
time wasn't affected by compression.

+------------+-------------------------+-----------------------
+------------------------------+
|  Dataset   | Compression.       | temp_bytes         | Execution
Time (ms)   |
+------------+-------------------------+-----------------------
+----------------------------- +
| A             | Yes                        |  3.09 GiB
| 22s586ms           | work_mem  = 4MB
|                | No                         |  21.89 GiB
| 35s                       | work_mem  = 4MB
+------------+-------------------------+-----------------------
+----------------------------------------
| B             | Yes                        |  333 MB
| 1815.545 ms       | work_mem = 4MB
|                 | No                        |  146  MB
| 1500.460 ms        | work_mem = 4MB
|                 | Yes                       |  0 MB
| 3262.305 ms        | work_mem = 80MB
|                 | No                        |  0 MB
| 3174.725 ms         | work_mem = 80MB
+-------------+------------------------+------------------------
+-------------------------------------
| C             | Yes                       | 40 MB
| 1011.020 ms        | work_mem = 1MB
|                | No                        |  53
MB                 |  1034.142 ms        | work_mem = 1MB
+------------+------------------------+------------------------
+--------------------------------------

Thanks. I'll try to do some benchmarks on my own.

Are these results fro ma single run, or an average of multiple runs?

It is average from multiple runs.

Do

you maybe have a script to reproduce this, including the data generation?

I am attaching my SQL file for database preparation. I also did further
testing
with two other machines( see attachment huge_tables.rtf ).

Also, can you share some information about the machine used for this? I
expect the impact to strongly depends on memory pressure - if the temp
file fits into page cache (and stays there), it may not benefit from the
compression, right?

If it fits into the page cache due to compression, I would consider it as
a benefit from compression.
I performed further testing on machines with different memory sizes.
Both experiments showed that compression was beneficial for execution
time.
The execution time reduction was more significant in the case of the
machine that had
less memory available.

Tests were performed on:
MacBook PRO M3 36GB - MacOs
Virtual machine ARM64 10GB/ 6CPU - Fedora 39

regards

--
Tomas Vondra

Attachments:

0001-This-commit-adds-support-for-temporary-files-compres-v3.patchapplication/octet-stream; name=0001-This-commit-adds-support-for-temporary-files-compres-v3.patchDownload+2005-16
#6Filip Janus
fjanus@redhat.com
In reply to: Filip Janus (#5)
Re: Proposal: Adding compression of temporary files

Even though i started with lz4, I added also pglz support and enhanced
memory management based on provided review.

-Filip-

čt 28. 11. 2024 v 12:32 odesílatel Filip Janus <fjanus@redhat.com> napsal:

Show quoted text

I've added a regression test for lz4 compression if the server is compiled
with the "--with-lz4" option.

-Filip-

ne 24. 11. 2024 v 15:53 odesílatel Filip Janus <fjanus@redhat.com> napsal:

-Filip-

st 20. 11. 2024 v 1:35 odesílatel Tomas Vondra <tomas@vondra.me> napsal:

Hi,

On 11/18/24 22:58, Filip Janus wrote:

...
Hi all,
Postgresql supports data compression nowadays, but the compression

of

temporary files has not been implemented yet. The huge queries can
produce a significant amount of temporary data that needs to
be stored on disk
and cause many expensive I/O operations.
I am attaching a proposal of the patch to enable temporary files
compression for
hashjoins for now. Initially, I've chosen the LZ4 compression
algorithm. It would
probably make better sense to start with pglz, but I realized it

late.

Thanks for the idea & patch. I agree this might be quite useful for
workloads generating a lot of temporary files for stuff like sorts etc.
I think it will be interesting to think about the trade offs, i.e. how
to pick the compression level - at some point the compression ratio
stops improving while paying more and more CPU time. Not sure what the
right choice is, so using default seems fine.

I agree it'd be better to start with pglz, and only then add lz4 etc.
Firstly, pglz is simply the built-in compression, supported everywhere.
And it's also simpler to implement, I think.

# Future possible improvements
Reducing the number of memory allocations within the dumping and
loading of
the buffer. I have two ideas for solving this problem. I would
either add a buffer into
struct BufFile or provide the buffer as an argument from the

caller.

For the sequential
execution, I would prefer the second option.

Yes, this would be good. Doing a palloc+pfree for each compression is
going to be expensive, especially because these buffers are going to be
large - likely larger than 8kB. Which means it's not cached in the
memory context, etc.

Adding it to the BufFile is not going to fly, because that doubles the
amount of memory per file. And we already have major issues with hash
joins consuming massive amounts of memory. But at the same time the
buffer is only needed during compression, and there's only one at a
time. So I agree with passing a single buffer as an argument.

# Future plan/open questions
In the future, I would like to add support for pglz and zstd.
Further, I plan to
extend the support of the temporary file compression also for
sorting, gist index creation, etc.

Experimenting with the stream mode of compression algorithms. The
compression
ratio of LZ4 in block mode seems to be satisfying, but the stream
mode could
produce a better ratio, but it would consume more memory due to the
requirement to store
context for LZ4 stream compression.

One thing I realized is that this only enables temp file compression for
a single place - hash join spill files. AFAIK this is because compressed
files don't support random access, and the other places might need that.

Is that correct? The patch does not explain this anywhere. If that's
correct, the patch probably should mention this in a comment for the
'compress' argument added to BufFileCreateTemp(), so that it's clear
when it's legal to set compress=true.

I will add the description there.

Which other places might compress temp files? Surely hash joins are not
the only place that could benefit from this, right?

Yes, you are definitely right. I have chosen the hash joins as a POC
because
there are no seeks besides seeks at the beginning of the buffer.
I have focused on hashjoins, but there are definitely also other places
where
the compression could be used. I want to add support in other places
in the feature.

Another thing is testing. If I run regression tests, it won't use
compression at all, because the GUC has "none" by default, right? But we
need some testing, so how would we do that? One option would be to add a
regression test that explicitly sets the GUC and does a hash join, but
that won't work with lz4 (because that may not be enabled).

Right, it's "none" by default. My opinion is that we would like to test
every supported compression method, so I will try to add environment
variable as
you recommended.

Another option might be to add a PG_TEST_xxx environment variable that
determines compression to use. Something like PG_TEST_USE_UNIX_SOCKETS.
But perhaps there's a simpler way.

# Benchmark
I prepared three different databases to check expectations. Each
dataset is described below. My testing demonstrates that my patch
improves the execution time of huge hash joins.
Also, my implementation should not
negatively affect performance within smaller queries.
The usage of memory needed for temporary files was reduced in every
execution without a significant impact on execution time.

*## Dataset A:*
Tables*
*
table_a(bigint id,text data_text,integer data_number) - 10000000

rows

table_b(bigint id, integer ref_id, numeric data_value, bytea
data_blob) - 10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id <http://
a.id> = b.id <http://b.id&gt;;

The tables contain highly compressible data.
The query demonstrated a reduction in the usage of the temporary
files ~20GB -> 3GB, based on this reduction also caused the

execution

time of the query to be reduced by about ~10s.

*## Dataset B:*
Tables:*
*
table_a(integer id, text data_blob) - 1110000 rows
table_b(integer id, text data_blob) - 10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id <http://
a.id> = b.id <http://b.id&gt;;

The tables contain less compressible data. data_blob was generated
by a pseudo-random generator.
In this case, the data reduction was only ~50%. Also, the execution
time was reduced
only slightly with the enabled compression.

The second scenario demonstrates no overhead in the case of

enabled

compression and extended work_mem to avoid temp file usage.

*## Dataset C:*
Tables
customers (integer,text,text,text,text)
order_items(integer,integer,integer,integer,numeric(10,2))
orders(integer,integer,timestamp,numeric(10,2))
products(integer,text,text,numeric(10,2),integer)

Query: SELECT p.product_id, p.name <http://p.name&gt;, p.price,
SUM(oi.quantity) AS total_quantity, AVG(oi.price) AS avg_item_price
FROM eshop.products p JOIN eshop.order_items oi ON p.product_id =
oi.product_id JOIN
eshop.orders o ON oi.order_id = o.order_id WHERE o.order_date >
'2020-01-01' AND p.price > 50
GROUP BY p.product_id, p.name <http://p.name&gt;, p.price HAVING
SUM(oi.quantity) > 1000
ORDER BY total_quantity DESC LIMIT 100;

This scenario should demonstrate a more realistic usage of the

database.

Enabled compression slightly reduced the temporary memory usage,

but

the execution
time wasn't affected by compression.

+------------+-------------------------+-----------------------
+------------------------------+
|  Dataset   | Compression.       | temp_bytes         | Execution
Time (ms)   |
+------------+-------------------------+-----------------------
+----------------------------- +
| A             | Yes                        |  3.09 GiB
| 22s586ms           | work_mem  = 4MB
|                | No                         |  21.89 GiB
| 35s                       | work_mem  = 4MB
+------------+-------------------------+-----------------------
+----------------------------------------
| B             | Yes                        |  333 MB

| 1815.545 ms | work_mem = 4MB
| | No | 146 MB
| 1500.460 ms | work_mem = 4MB
| | Yes | 0 MB
| 3262.305 ms | work_mem = 80MB
| | No | 0 MB

| 3174.725 ms         | work_mem = 80MB
+-------------+------------------------+------------------------
+-------------------------------------
| C             | Yes                       | 40

MB

| 1011.020 ms        | work_mem = 1MB
|                | No                        |  53
MB                 |  1034.142 ms        | work_mem = 1MB
+------------+------------------------+------------------------
+--------------------------------------

Thanks. I'll try to do some benchmarks on my own.

Are these results fro ma single run, or an average of multiple runs?

It is average from multiple runs.

Do

you maybe have a script to reproduce this, including the data generation?

I am attaching my SQL file for database preparation. I also did further
testing
with two other machines( see attachment huge_tables.rtf ).

Also, can you share some information about the machine used for this? I
expect the impact to strongly depends on memory pressure - if the temp
file fits into page cache (and stays there), it may not benefit from the
compression, right?

If it fits into the page cache due to compression, I would consider it as
a benefit from compression.
I performed further testing on machines with different memory sizes.
Both experiments showed that compression was beneficial for execution
time.
The execution time reduction was more significant in the case of the
machine that had
less memory available.

Tests were performed on:
MacBook PRO M3 36GB - MacOs
Virtual machine ARM64 10GB/ 6CPU - Fedora 39

regards

--
Tomas Vondra

Attachments:

0001-This-commit-adds-support-for-temporary-files-compres.patchapplication/octet-stream; name=0001-This-commit-adds-support-for-temporary-files-compres.patchDownload+2006-17
0002-This-commit-enhance-temporary-file-compression.patchapplication/octet-stream; name=0002-This-commit-enhance-temporary-file-compression.patchDownload+78-33
0003-Add-test-for-pglz-compression-of-temporary-files.patchapplication/octet-stream; name=0003-Add-test-for-pglz-compression-of-temporary-files.patchDownload+1795-2
#7Filip Janus
fjanus@redhat.com
In reply to: Filip Janus (#6)
Re: Proposal: Adding compression of temporary files

I apologize for multiple messages, but I found a small bug in the previous
version.

-Filip-

so 4. 1. 2025 v 23:40 odesílatel Filip Janus <fjanus@redhat.com> napsal:

Show quoted text

Even though i started with lz4, I added also pglz support and enhanced
memory management based on provided review.

-Filip-

čt 28. 11. 2024 v 12:32 odesílatel Filip Janus <fjanus@redhat.com> napsal:

I've added a regression test for lz4 compression if the server is
compiled with the "--with-lz4" option.

-Filip-

ne 24. 11. 2024 v 15:53 odesílatel Filip Janus <fjanus@redhat.com>
napsal:

-Filip-

st 20. 11. 2024 v 1:35 odesílatel Tomas Vondra <tomas@vondra.me> napsal:

Hi,

On 11/18/24 22:58, Filip Janus wrote:

...
Hi all,
Postgresql supports data compression nowadays, but the

compression of

temporary files has not been implemented yet. The huge queries

can

produce a significant amount of temporary data that needs to
be stored on disk
and cause many expensive I/O operations.
I am attaching a proposal of the patch to enable temporary files
compression for
hashjoins for now. Initially, I've chosen the LZ4 compression
algorithm. It would
probably make better sense to start with pglz, but I realized it

late.

Thanks for the idea & patch. I agree this might be quite useful for
workloads generating a lot of temporary files for stuff like sorts etc.
I think it will be interesting to think about the trade offs, i.e. how
to pick the compression level - at some point the compression ratio
stops improving while paying more and more CPU time. Not sure what the
right choice is, so using default seems fine.

I agree it'd be better to start with pglz, and only then add lz4 etc.
Firstly, pglz is simply the built-in compression, supported everywhere.
And it's also simpler to implement, I think.

# Future possible improvements
Reducing the number of memory allocations within the dumping and
loading of
the buffer. I have two ideas for solving this problem. I would
either add a buffer into
struct BufFile or provide the buffer as an argument from the

caller.

For the sequential
execution, I would prefer the second option.

Yes, this would be good. Doing a palloc+pfree for each compression is
going to be expensive, especially because these buffers are going to be
large - likely larger than 8kB. Which means it's not cached in the
memory context, etc.

Adding it to the BufFile is not going to fly, because that doubles the
amount of memory per file. And we already have major issues with hash
joins consuming massive amounts of memory. But at the same time the
buffer is only needed during compression, and there's only one at a
time. So I agree with passing a single buffer as an argument.

# Future plan/open questions
In the future, I would like to add support for pglz and zstd.
Further, I plan to
extend the support of the temporary file compression also for
sorting, gist index creation, etc.

Experimenting with the stream mode of compression algorithms. The
compression
ratio of LZ4 in block mode seems to be satisfying, but the stream
mode could
produce a better ratio, but it would consume more memory due to

the

requirement to store
context for LZ4 stream compression.

One thing I realized is that this only enables temp file compression for
a single place - hash join spill files. AFAIK this is because compressed
files don't support random access, and the other places might need that.

Is that correct? The patch does not explain this anywhere. If that's
correct, the patch probably should mention this in a comment for the
'compress' argument added to BufFileCreateTemp(), so that it's clear
when it's legal to set compress=true.

I will add the description there.

Which other places might compress temp files? Surely hash joins are not
the only place that could benefit from this, right?

Yes, you are definitely right. I have chosen the hash joins as a POC
because
there are no seeks besides seeks at the beginning of the buffer.
I have focused on hashjoins, but there are definitely also other places
where
the compression could be used. I want to add support in other places
in the feature.

Another thing is testing. If I run regression tests, it won't use
compression at all, because the GUC has "none" by default, right? But we
need some testing, so how would we do that? One option would be to add a
regression test that explicitly sets the GUC and does a hash join, but
that won't work with lz4 (because that may not be enabled).

Right, it's "none" by default. My opinion is that we would like to test
every supported compression method, so I will try to add environment
variable as
you recommended.

Another option might be to add a PG_TEST_xxx environment variable that
determines compression to use. Something like PG_TEST_USE_UNIX_SOCKETS.
But perhaps there's a simpler way.

# Benchmark
I prepared three different databases to check expectations. Each
dataset is described below. My testing demonstrates that my patch
improves the execution time of huge hash joins.
Also, my implementation should not
negatively affect performance within smaller queries.
The usage of memory needed for temporary files was reduced in

every

execution without a significant impact on execution time.

*## Dataset A:*
Tables*
*
table_a(bigint id,text data_text,integer data_number) - 10000000

rows

table_b(bigint id, integer ref_id, numeric data_value, bytea
data_blob) - 10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id <http://
a.id> = b.id <http://b.id&gt;;

The tables contain highly compressible data.
The query demonstrated a reduction in the usage of the temporary
files ~20GB -> 3GB, based on this reduction also caused the

execution

time of the query to be reduced by about ~10s.

*## Dataset B:*
Tables:*
*
table_a(integer id, text data_blob) - 1110000 rows
table_b(integer id, text data_blob) - 10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id <http://
a.id> = b.id <http://b.id&gt;;

The tables contain less compressible data. data_blob was generated
by a pseudo-random generator.
In this case, the data reduction was only ~50%. Also, the

execution

time was reduced
only slightly with the enabled compression.

The second scenario demonstrates no overhead in the case of

enabled

compression and extended work_mem to avoid temp file usage.

*## Dataset C:*
Tables
customers (integer,text,text,text,text)
order_items(integer,integer,integer,integer,numeric(10,2))
orders(integer,integer,timestamp,numeric(10,2))
products(integer,text,text,numeric(10,2),integer)

Query: SELECT p.product_id, p.name <http://p.name&gt;, p.price,
SUM(oi.quantity) AS total_quantity, AVG(oi.price) AS

avg_item_price

FROM eshop.products p JOIN eshop.order_items oi ON p.product_id =
oi.product_id JOIN
eshop.orders o ON oi.order_id = o.order_id WHERE o.order_date >
'2020-01-01' AND p.price > 50
GROUP BY p.product_id, p.name <http://p.name&gt;, p.price HAVING
SUM(oi.quantity) > 1000
ORDER BY total_quantity DESC LIMIT 100;

This scenario should demonstrate a more realistic usage of the

database.

Enabled compression slightly reduced the temporary memory usage,

but

the execution
time wasn't affected by compression.

+------------+-------------------------+-----------------------
+------------------------------+
|  Dataset   | Compression.       | temp_bytes         | Execution
Time (ms)   |
+------------+-------------------------+-----------------------
+----------------------------- +
| A             | Yes                        |  3.09 GiB

| 22s586ms | work_mem = 4MB
| | No | 21.89 GiB

| 35s                       | work_mem  = 4MB
+------------+-------------------------+-----------------------
+----------------------------------------
| B             | Yes                        |  333 MB

| 1815.545 ms | work_mem = 4MB
| | No | 146 MB

| 1500.460 ms | work_mem = 4MB
| | Yes | 0 MB

| 3262.305 ms | work_mem = 80MB
| | No | 0 MB

| 3174.725 ms         | work_mem = 80MB
+-------------+------------------------+------------------------
+-------------------------------------
| C             | Yes                       | 40

MB

| 1011.020 ms        | work_mem = 1MB
|                | No                        |  53
MB                 |  1034.142 ms        | work_mem = 1MB
+------------+------------------------+------------------------
+--------------------------------------

Thanks. I'll try to do some benchmarks on my own.

Are these results fro ma single run, or an average of multiple runs?

It is average from multiple runs.

Do

you maybe have a script to reproduce this, including the data
generation?

I am attaching my SQL file for database preparation. I also did further
testing
with two other machines( see attachment huge_tables.rtf ).

Also, can you share some information about the machine used for this? I
expect the impact to strongly depends on memory pressure - if the temp
file fits into page cache (and stays there), it may not benefit from the
compression, right?

If it fits into the page cache due to compression, I would consider it
as a benefit from compression.
I performed further testing on machines with different memory sizes.
Both experiments showed that compression was beneficial for execution
time.
The execution time reduction was more significant in the case of the
machine that had
less memory available.

Tests were performed on:
MacBook PRO M3 36GB - MacOs
Virtual machine ARM64 10GB/ 6CPU - Fedora 39

regards

--
Tomas Vondra

Attachments:

0001-This-commit-adds-support-for-temporary-files-compres.patchapplication/octet-stream; name=0001-This-commit-adds-support-for-temporary-files-compres.patchDownload+2006-17
0002-This-commit-enhance-temporary-file-compression.patchapplication/octet-stream; name=0002-This-commit-enhance-temporary-file-compression.patchDownload+80-39
0003-Add-test-for-pglz-compression-of-temporary-files.patchapplication/octet-stream; name=0003-Add-test-for-pglz-compression-of-temporary-files.patchDownload+1795-2
#8Alexander Korotkov
aekorotkov@gmail.com
In reply to: Filip Janus (#7)
Re: Proposal: Adding compression of temporary files

On Sun, Jan 5, 2025 at 1:43 AM Filip Janus <fjanus@redhat.com> wrote:

I apologize for multiple messages, but I found a small bug in the previous version.

-Filip-

Great, thank you for your work.

I think the patches could use a pgindent run.

I don't see a reason why the temp file compression method should be
different from the wal compression methods, which we already have
in-tree. Perhaps it would be nice to have a 0001 patch, which would
abstract the compression methods we now have for wal into a separate
file containing GUC option values and functions for
compress/decompress. Then, 0002 would apply this to temporary file
compression.

------
Regards,
Alexander Korotkov
Supabase

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alexander Korotkov (#8)
Re: Proposal: Adding compression of temporary files

On 3/15/25 11:40, Alexander Korotkov wrote:

On Sun, Jan 5, 2025 at 1:43 AM Filip Janus <fjanus@redhat.com> wrote:

I apologize for multiple messages, but I found a small bug in the previous version.

-Filip-

Great, thank you for your work.

I think the patches could use a pgindent run.

I don't see a reason why the temp file compression method should be
different from the wal compression methods, which we already have
in-tree. Perhaps it would be nice to have a 0001 patch, which would
abstract the compression methods we now have for wal into a separate
file containing GUC option values and functions for
compress/decompress. Then, 0002 would apply this to temporary file
compression.

Not sure I understand the design you're proposing ...

AFAIK the WAL compression is not compressing the file data directly,
it's compressing backup blocks one by one, which then get written to WAL
as one piece of a record. So it's dealing with individual blocks, not
files, and we already have API to compress blocks (well, it's pretty
much the APIs for each compression method).

You're proposing abstracting that into a separate file - what would be
in that file? How would you abstract this to make it also useful for
file compression?

I can imagine a function CompressBufffer(method, dst, src, ...) wrapping
the various compression methods, unifying the error handling, etc. I can
imagine that, but that API is also limiting - e.g. how would that work
with stream compression, which seems irrelevant for WAL, but might be
very useful for tempfile compression.

IIRC this is mostly why we didn't try to do such generic API for pg_dump
compression, there's a local pg_dump-specific abstraction.

FWIW looking at the patch, I still don't quite understand why it needs
to correct the offset like this:

+    if (!file->compress)
+        file->curOffset -= (file->nbytes - file->pos);
+    else
+        if (nbytesOriginal - file->pos != 0)
+            /* curOffset must be corrected also if compression is
+             * enabled, nbytes was changed by compression but we
+             * have to use the original value of nbytes
+             */
+            file->curOffset-=bytestowrite;

It's not something introduced by the compression patch - the first part
is what we used to do before. But I find it a bit confusing - isn't it
mixing the correction of "logical file position" adjustment we did
before, and also the adjustment possibly needed due to compression?

In fact, isn't it going to fail if the code gets multiple loops in

while (wpos < file->nbytes)
{
...
}

because bytestowrite will be the value from the last loop? I haven't
tried, but I guess writing wide tuples (more than 8k) might fail.

regards

--
Tomas Vondra

#10Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tomas Vondra (#9)
Re: Proposal: Adding compression of temporary files

On Tue, Mar 18, 2025 at 12:13 AM Tomas Vondra <tomas@vondra.me> wrote:

On 3/15/25 11:40, Alexander Korotkov wrote:

On Sun, Jan 5, 2025 at 1:43 AM Filip Janus <fjanus@redhat.com> wrote:

I apologize for multiple messages, but I found a small bug in the previous version.

-Filip-

Great, thank you for your work.

I think the patches could use a pgindent run.

I don't see a reason why the temp file compression method should be
different from the wal compression methods, which we already have
in-tree. Perhaps it would be nice to have a 0001 patch, which would
abstract the compression methods we now have for wal into a separate
file containing GUC option values and functions for
compress/decompress. Then, 0002 would apply this to temporary file
compression.

Not sure I understand the design you're proposing ...

AFAIK the WAL compression is not compressing the file data directly,
it's compressing backup blocks one by one, which then get written to WAL
as one piece of a record. So it's dealing with individual blocks, not
files, and we already have API to compress blocks (well, it's pretty
much the APIs for each compression method).

You're proposing abstracting that into a separate file - what would be
in that file? How would you abstract this to make it also useful for
file compression?

I can imagine a function CompressBufffer(method, dst, src, ...) wrapping
the various compression methods, unifying the error handling, etc. I can
imagine that, but that API is also limiting - e.g. how would that work
with stream compression, which seems irrelevant for WAL, but might be
very useful for tempfile compression.

Yes, I was thinking about some generic API that provides a safe way to
compress some data chunk with given compression method. It seems that
yet it should suit both WAL, toast and temp files in the current
implementation. But, yes, if we would implement streaming compression
in future that would require another API.

------
Regards,
Alexander Korotkov
Supabase

#11Filip Janus
fjanus@redhat.com
In reply to: Tomas Vondra (#9)
Re: Proposal: Adding compression of temporary files

-Filip-

po 17. 3. 2025 v 23:13 odesílatel Tomas Vondra <tomas@vondra.me> napsal:

On 3/15/25 11:40, Alexander Korotkov wrote:

On Sun, Jan 5, 2025 at 1:43 AM Filip Janus <fjanus@redhat.com> wrote:

I apologize for multiple messages, but I found a small bug in the

previous version.

-Filip-

Great, thank you for your work.

I think the patches could use a pgindent run.

I don't see a reason why the temp file compression method should be
different from the wal compression methods, which we already have
in-tree. Perhaps it would be nice to have a 0001 patch, which would
abstract the compression methods we now have for wal into a separate
file containing GUC option values and functions for
compress/decompress. Then, 0002 would apply this to temporary file
compression.

Not sure I understand the design you're proposing ...

AFAIK the WAL compression is not compressing the file data directly,
it's compressing backup blocks one by one, which then get written to WAL
as one piece of a record. So it's dealing with individual blocks, not
files, and we already have API to compress blocks (well, it's pretty
much the APIs for each compression method).

You're proposing abstracting that into a separate file - what would be
in that file? How would you abstract this to make it also useful for
file compression?

I can imagine a function CompressBufffer(method, dst, src, ...) wrapping
the various compression methods, unifying the error handling, etc. I can
imagine that, but that API is also limiting - e.g. how would that work
with stream compression, which seems irrelevant for WAL, but might be
very useful for tempfile compression.

IIRC this is mostly why we didn't try to do such generic API for pg_dump
compression, there's a local pg_dump-specific abstraction.

FWIW looking at the patch, I still don't quite understand why it needs
to correct the offset like this:

+    if (!file->compress)
+        file->curOffset -= (file->nbytes - file->pos);

This line of code is really confusing to me, and I wasn't able to fully
understand why it must be done,
but I experimented with it, and if I remember correctly, it's triggered
(the result differs from 0) mainly in the last call of
BufFileDumpBuffer function for a single data chunk.

+    else
+        if (nbytesOriginal - file->pos != 0)
+            /* curOffset must be corrected also if compression is
+             * enabled, nbytes was changed by compression but we
+             * have to use the original value of nbytes
+             */
+            file->curOffset-=bytestowrite;

It's not something introduced by the compression patch - the first part
is what we used to do before. But I find it a bit confusing - isn't it
mixing the correction of "logical file position" adjustment we did
before, and also the adjustment possibly needed due to compression?

In fact, isn't it going to fail if the code gets multiple loops in

while (wpos < file->nbytes)
{
...
}

because bytestowrite will be the value from the last loop? I haven't
tried, but I guess writing wide tuples (more than 8k) might fail.

I will definitely test it with larger tuples than 8K.

Maybe I don't understand it correctly,
the adjustment is performed in the case that file->nbytes and file->pos
differ.
So it must persist also if we are working with the compressed data, but the
problem is that data stored and compressed on disk has different sizes than
data incoming uncompressed ones, so what should be the correction value.
By debugging, I realized that the correction should correspond to the size
of
bytestowrite from the last iteration of the loop.

Show quoted text

regards

--
Tomas Vondra

#12Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Filip Janus (#11)
Re: Proposal: Adding compression of temporary files

On Fri, Mar 28, 2025 at 09:23:13AM GMT, Filip Janus wrote:

+    else
+        if (nbytesOriginal - file->pos != 0)
+            /* curOffset must be corrected also if compression is
+             * enabled, nbytes was changed by compression but we
+             * have to use the original value of nbytes
+             */
+            file->curOffset-=bytestowrite;

It's not something introduced by the compression patch - the first part
is what we used to do before. But I find it a bit confusing - isn't it
mixing the correction of "logical file position" adjustment we did
before, and also the adjustment possibly needed due to compression?

In fact, isn't it going to fail if the code gets multiple loops in

while (wpos < file->nbytes)
{
...
}

because bytestowrite will be the value from the last loop? I haven't
tried, but I guess writing wide tuples (more than 8k) might fail.

I will definitely test it with larger tuples than 8K.

Maybe I don't understand it correctly,
the adjustment is performed in the case that file->nbytes and file->pos
differ.
So it must persist also if we are working with the compressed data, but the
problem is that data stored and compressed on disk has different sizes than
data incoming uncompressed ones, so what should be the correction value.
By debugging, I realized that the correction should correspond to the size
of
bytestowrite from the last iteration of the loop.

I agree, this looks strange. If the idea is to set curOffset to its
original value + pos, and the original value was advanced multiple times
by bytestowrite, it seems incorrect to adjust it by bytestowrite, it
seems incorrect to adjust it only once. From what I see current tests do
not exercise a case where the while will get multiple loops, so it looks
fine.

At the same time maybe I'm missing something, but how exactly such test
for 8k tuples and multiple loops in the while block should look like?
E.g. when I force a hash join on a table with a single wide text column,
the minimal tuple that is getting written to the temporary file still
has rather small length, I assume due to toasting. Is there some other
way to achieve that?

#13Filip Janus
fjanus@redhat.com
In reply to: Dmitry Dolgov (#12)
Re: Proposal: Adding compression of temporary files

Since the patch was prepared months ago, it needs to be rebased.

-Filip-

ne 13. 4. 2025 v 21:53 odesílatel Dmitry Dolgov <9erthalion6@gmail.com>
napsal:

Show quoted text

On Fri, Mar 28, 2025 at 09:23:13AM GMT, Filip Janus wrote:

+    else
+        if (nbytesOriginal - file->pos != 0)
+            /* curOffset must be corrected also if compression is
+             * enabled, nbytes was changed by compression but we
+             * have to use the original value of nbytes
+             */
+            file->curOffset-=bytestowrite;

It's not something introduced by the compression patch - the first part
is what we used to do before. But I find it a bit confusing - isn't it
mixing the correction of "logical file position" adjustment we did
before, and also the adjustment possibly needed due to compression?

In fact, isn't it going to fail if the code gets multiple loops in

while (wpos < file->nbytes)
{
...
}

because bytestowrite will be the value from the last loop? I haven't
tried, but I guess writing wide tuples (more than 8k) might fail.

I will definitely test it with larger tuples than 8K.

Maybe I don't understand it correctly,
the adjustment is performed in the case that file->nbytes and file->pos
differ.
So it must persist also if we are working with the compressed data, but

the

problem is that data stored and compressed on disk has different sizes

than

data incoming uncompressed ones, so what should be the correction value.
By debugging, I realized that the correction should correspond to the

size

of
bytestowrite from the last iteration of the loop.

I agree, this looks strange. If the idea is to set curOffset to its
original value + pos, and the original value was advanced multiple times
by bytestowrite, it seems incorrect to adjust it by bytestowrite, it
seems incorrect to adjust it only once. From what I see current tests do
not exercise a case where the while will get multiple loops, so it looks
fine.

At the same time maybe I'm missing something, but how exactly such test
for 8k tuples and multiple loops in the while block should look like?
E.g. when I force a hash join on a table with a single wide text column,
the minimal tuple that is getting written to the temporary file still
has rather small length, I assume due to toasting. Is there some other
way to achieve that?

Attachments:

0002-Add-test-for-temporary-files-compression-this-commit.patchapplication/octet-stream; name=0002-Add-test-for-temporary-files-compression-this-commit.patchDownload+1259-2
0001-This-commit-adds-support-for-temporary-files-compres.patchapplication/octet-stream; name=0001-This-commit-adds-support-for-temporary-files-compres.patchDownload+252-17
#14Filip Janus
fjanus@redhat.com
In reply to: Filip Janus (#13)
Re: Proposal: Adding compression of temporary files

The latest rebase.

-Filip-

út 22. 4. 2025 v 9:17 odesílatel Filip Janus <fjanus@redhat.com> napsal:

Show quoted text

Since the patch was prepared months ago, it needs to be rebased.

-Filip-

ne 13. 4. 2025 v 21:53 odesílatel Dmitry Dolgov <9erthalion6@gmail.com>
napsal:

On Fri, Mar 28, 2025 at 09:23:13AM GMT, Filip Janus wrote:

+    else
+        if (nbytesOriginal - file->pos != 0)
+            /* curOffset must be corrected also if compression is
+             * enabled, nbytes was changed by compression but we
+             * have to use the original value of nbytes
+             */
+            file->curOffset-=bytestowrite;

It's not something introduced by the compression patch - the first

part

is what we used to do before. But I find it a bit confusing - isn't it
mixing the correction of "logical file position" adjustment we did
before, and also the adjustment possibly needed due to compression?

In fact, isn't it going to fail if the code gets multiple loops in

while (wpos < file->nbytes)
{
...
}

because bytestowrite will be the value from the last loop? I haven't
tried, but I guess writing wide tuples (more than 8k) might fail.

I will definitely test it with larger tuples than 8K.

Maybe I don't understand it correctly,
the adjustment is performed in the case that file->nbytes and file->pos
differ.
So it must persist also if we are working with the compressed data, but

the

problem is that data stored and compressed on disk has different sizes

than

data incoming uncompressed ones, so what should be the correction value.
By debugging, I realized that the correction should correspond to the

size

of
bytestowrite from the last iteration of the loop.

I agree, this looks strange. If the idea is to set curOffset to its
original value + pos, and the original value was advanced multiple times
by bytestowrite, it seems incorrect to adjust it by bytestowrite, it
seems incorrect to adjust it only once. From what I see current tests do
not exercise a case where the while will get multiple loops, so it looks
fine.

At the same time maybe I'm missing something, but how exactly such test
for 8k tuples and multiple loops in the while block should look like?
E.g. when I force a hash join on a table with a single wide text column,
the minimal tuple that is getting written to the temporary file still
has rather small length, I assume due to toasting. Is there some other
way to achieve that?

Attachments:

0002-Add-test-for-temporary-files-compression-this-commit.patchapplication/octet-stream; name=0002-Add-test-for-temporary-files-compression-this-commit.patchDownload+3591-2
0001-This-commit-adds-support-for-temporary-files-compres.patchapplication/octet-stream; name=0001-This-commit-adds-support-for-temporary-files-compres.patchDownload+251-16
#15Andres Freund
andres@anarazel.de
In reply to: Filip Janus (#14)
Re: Proposal: Adding compression of temporary files

Hi,

On 2025-04-25 23:54:00 +0200, Filip Janus wrote:

The latest rebase.

This often seems to fail during tests:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5382

E.g.
https://api.cirrus-ci.com/v1/artifact/task/4667337632120832/testrun/build-32/testrun/recovery/027_stream_regress/log/regress_log_027_stream_regress

=== dumping /tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/regression.diffs ===
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out /tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out	2025-05-26 05:04:40.686524215 +0000
+++ /tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out	2025-05-26 05:15:00.534907680 +0000
@@ -594,11 +594,8 @@
 select count(*) from join_foo
   left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
   on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
- count 
--------
-     3
-(1 row)
-
+ERROR:  could not read from temporary file: read only 8180 of 1572860 bytes
+CONTEXT:  parallel worker
 select final > 1 as multibatch
   from hash_join_batches(
 $$
@@ -606,11 +603,7 @@
     left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
     on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
 $$);
- multibatch 
-------------
- t
-(1 row)
-
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
 rollback to settings;
 -- single-batch with rescan, parallel-oblivious
 savepoint settings;

Greetings,

Andres

#16Filip Janus
fjanus@redhat.com
In reply to: Andres Freund (#15)
Re: Proposal: Adding compression of temporary files

I rebased the proposal and fixed the problem causing those problems.

-Filip-

út 17. 6. 2025 v 16:49 odesílatel Andres Freund <andres@anarazel.de> napsal:

Show quoted text

Hi,

On 2025-04-25 23:54:00 +0200, Filip Janus wrote:

The latest rebase.

This often seems to fail during tests:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5382

E.g.

https://api.cirrus-ci.com/v1/artifact/task/4667337632120832/testrun/build-32/testrun/recovery/027_stream_regress/log/regress_log_027_stream_regress

=== dumping
/tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/regression.diffs
===
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out
/tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out
2025-05-26 05:04:40.686524215 +0000
+++
/tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
2025-05-26 05:15:00.534907680 +0000
@@ -594,11 +594,8 @@
select count(*) from join_foo
left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using
(id)) ss
on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
- count
--------
-     3
-(1 row)
-
+ERROR:  could not read from temporary file: read only 8180 of 1572860
bytes
+CONTEXT:  parallel worker
select final > 1 as multibatch
from hash_join_batches(
$$
@@ -606,11 +603,7 @@
left join (select b1.id, b1.t from join_bar b1 join join_bar b2
using (id)) ss
on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
$$);
- multibatch
-------------
- t
-(1 row)
-
+ERROR:  current transaction is aborted, commands ignored until end of
transaction block
rollback to settings;
-- single-batch with rescan, parallel-oblivious
savepoint settings;

Greetings,

Andres

Attachments:

0002-Add-regression-tests-for-temporary-file-compression.patchapplication/octet-stream; name=0002-Add-regression-tests-for-temporary-file-compression.patchDownload+3591-10
0001-Add-transparent-compression-for-temporary-files.patchapplication/octet-stream; name=0001-Add-transparent-compression-for-temporary-files.patchDownload+337-31
#17Filip Janus
fjanus@redhat.com
In reply to: Filip Janus (#16)
Re: Proposal: Adding compression of temporary files

Fix overlooked compiler warnings

-Filip-

po 18. 8. 2025 v 18:51 odesílatel Filip Janus <fjanus@redhat.com> napsal:

Show quoted text

I rebased the proposal and fixed the problem causing those problems.

-Filip-

út 17. 6. 2025 v 16:49 odesílatel Andres Freund <andres@anarazel.de>
napsal:

Hi,

On 2025-04-25 23:54:00 +0200, Filip Janus wrote:

The latest rebase.

This often seems to fail during tests:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5382

E.g.

https://api.cirrus-ci.com/v1/artifact/task/4667337632120832/testrun/build-32/testrun/recovery/027_stream_regress/log/regress_log_027_stream_regress

=== dumping
/tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/regression.diffs
===
diff -U3
/tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out
/tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out
2025-05-26 05:04:40.686524215 +0000
+++
/tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
2025-05-26 05:15:00.534907680 +0000
@@ -594,11 +594,8 @@
select count(*) from join_foo
left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using
(id)) ss
on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
- count
--------
-     3
-(1 row)
-
+ERROR:  could not read from temporary file: read only 8180 of 1572860
bytes
+CONTEXT:  parallel worker
select final > 1 as multibatch
from hash_join_batches(
$$
@@ -606,11 +603,7 @@
left join (select b1.id, b1.t from join_bar b1 join join_bar b2
using (id)) ss
on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
$$);
- multibatch
-------------
- t
-(1 row)
-
+ERROR:  current transaction is aborted, commands ignored until end of
transaction block
rollback to settings;
-- single-batch with rescan, parallel-oblivious
savepoint settings;

Greetings,

Andres

Attachments:

0001-Add-transparent-compression-for-temporary-files.patchapplication/octet-stream; name=0001-Add-transparent-compression-for-temporary-files.patchDownload+342-32
0002-Add-regression-tests-for-temporary-file-compression.patchapplication/octet-stream; name=0002-Add-regression-tests-for-temporary-file-compression.patchDownload+3591-10
#18Filip Janus
fjanus@redhat.com
In reply to: Filip Janus (#17)
Re: Proposal: Adding compression of temporary files

Rebase after changes introduced in guc_tables.c

-Filip-

út 19. 8. 2025 v 17:48 odesílatel Filip Janus <fjanus@redhat.com> napsal:

Show quoted text

Fix overlooked compiler warnings

-Filip-

po 18. 8. 2025 v 18:51 odesílatel Filip Janus <fjanus@redhat.com> napsal:

I rebased the proposal and fixed the problem causing those problems.

-Filip-

út 17. 6. 2025 v 16:49 odesílatel Andres Freund <andres@anarazel.de>
napsal:

Hi,

On 2025-04-25 23:54:00 +0200, Filip Janus wrote:

The latest rebase.

This often seems to fail during tests:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5382

E.g.

https://api.cirrus-ci.com/v1/artifact/task/4667337632120832/testrun/build-32/testrun/recovery/027_stream_regress/log/regress_log_027_stream_regress

=== dumping
/tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/regression.diffs
===
diff -U3
/tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out
/tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/join_hash_pglz.out
2025-05-26 05:04:40.686524215 +0000
+++
/tmp/cirrus-ci-build/build-32/testrun/recovery/027_stream_regress/data/results/join_hash_pglz.out
2025-05-26 05:15:00.534907680 +0000
@@ -594,11 +594,8 @@
select count(*) from join_foo
left join (select b1.id, b1.t from join_bar b1 join join_bar b2
using (id)) ss
on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
- count
--------
-     3
-(1 row)
-
+ERROR:  could not read from temporary file: read only 8180 of 1572860
bytes
+CONTEXT:  parallel worker
select final > 1 as multibatch
from hash_join_batches(
$$
@@ -606,11 +603,7 @@
left join (select b1.id, b1.t from join_bar b1 join join_bar b2
using (id)) ss
on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
$$);
- multibatch
-------------
- t
-(1 row)
-
+ERROR:  current transaction is aborted, commands ignored until end of
transaction block
rollback to settings;
-- single-batch with rescan, parallel-oblivious
savepoint settings;

Greetings,

Andres

Attachments:

0001-Add-transparent-compression-for-temporary-files.patchapplication/octet-stream; name=0001-Add-transparent-compression-for-temporary-files.patchDownload+338-32
0002-Add-regression-tests-for-temporary-file-compression.patchapplication/octet-stream; name=0002-Add-regression-tests-for-temporary-file-compression.patchDownload+3591-10
#19Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Filip Janus (#18)
Re: Proposal: Adding compression of temporary files

Hello Filip,

Thanks for the updated patch, and for your patience with working on this
patch with (unfortunately) little feedback. I took a look at the patch,
and did some testing. In general, I think it's heading in the right
direction, but there's still a couple issues and open questions.

Attached is a collection of incremental patches with the proposed
changes. I'll briefly explain the motivation for each patch, but it's
easier to share the complete change as a patch. Feel free to disagree
with the changes, some are a matter of opinion, and/or there might be a
better way to do that. Ultimately it should be squashed to the main
patch, or perhaps a couple larger patches.

v20250930-0001-Add-transparent-compression-for-temporary-.patch

- original patch, as posted on 2025/09/26

v20250930-0002-whitespace.patch

- cleanup of whitespace issues
- This is visible in git-show or when applying using git-am.

v20250930-0003-pgindent.patch

- pgindent run, to fix code style (formatting of if/else branches,
indentation, that kind of stuff)
- good to run pgindent every now and then, for consistency

v20250930-0004-Add-regression-tests-for-temporary-file-co.patch

- original patch, as posted on 2025/09/26

v20250930-0005-remove-unused-BufFile-compress_tempfile.patch

- the compress_tempfile was unused, get rid of it

v20250930-0006-simplify-BufFileCreateTemp-interface.patch

- I think the proposed interface (a "compress" flag in BufFileCreateTemp
and then a separate method BufFileCreateCompressTemp) is not great.

- The "compress" flag is a bit pointless, because even if you set it to
"true" you won't get compressed file. In fact, it's fragile - you'll get
broken BufFile without the buffer.

- The patch gets rid of the "compress" flag (so existing callers of
BufFileCreateTemp remain unchanged). BufFileCreateCompressTemp sets the
flag directly, which it can because it's in the same module.

- An alternative would be to keep the flag, do all the compression setup
in BufFileCreateTemp, and get rid of BufFileCreateCompressTemp. Not sure
which is better.

v20250930-0007-improve-BufFileCreateTemp-BufFileCreateCom.patch

- Just improving comments, to document the new stuff (needs a check).

- There are two new XXX comments, with questions. One asks if the
allocation is an issue in practice - is the static buffer worth it? The
other suggests we add an assert protecting against unsupported seeks.

v20250930-0008-BufFileCreateCompressTemp-cleanup-and-comm.patch

- A small BufFileCreateCompressTemp cleanup (better comments, better
variable names, formatting, extra assert, ... mostly cosmetic stuff).

- But this made me realize the 'static buffer' idea is likely flawed, at
least the current code. It does pfree() on the current buffer, but how
does it know if there are other files referencing it? Because it then
stashes the buffer to file->buffer. I haven't tried to reproduce the
issue, nor fixed this, but it seems like it might be a problem if two
files happen to use a different compression method.

v20250930-0009-minor-BufFileLoadBuffer-cleanup.patch

- Small BufFileLoadBuffer cleanup, I don't think it's worth it to have
such detailed error messages. So just use "could not read file" and then
whatever libc appends as %m.

v20250930-0010-BufFileLoadBuffer-simpler-FileRead-handlin.patch
v20250930-0011-BufFileLoadBuffer-simpler-FileRead-handlin.patch

- I was somewhat confused by the FileRead handling in BufFileLoadBuffer,
so these two patches try to improve / clarify it.

- I still don't understand the purpose of the "if (nread_orig <= 0)"
branch removed by the second patch.

v20250930-0012-BufFileLoadBuffer-comment-update.patch

- minor comment tweak

v20250930-0013-BufFileLoadBuffer-simplify-skipping-header.patch

- I found it confusing how the code advanced the offset by first adding
to header_advance, and only then adding to curOffset much later. This
gets rid of that, and just advances curOffset right after each read.

v20250930-0014-BufFileDumpBuffer-cleanup-simplification.patch

- Improve the comments in BufFileDumpBuffer, and simplify the code. This
is somewhat subjective, but I think the code is more readable.

- It temporarily removes the handling of -1 for pglz compression. This
was a mistake, and is fixed by a later patch.

v20250930-0015-BufFileLoadBuffer-comment.patch

- XXX for a comment I don't understand.

v20250930-0016-BufFileLoadBuffer-missing-FileRead-error-h.patch

- Points out a FileRead call missing error handling (there's another one
with the same issue).

v20250930-0017-simplify-the-compression-header.patch

- I came to the conclusion that having one "length" field for lz4 and
two (compressed + raw) for pglz makes the code unnecessarily complex,
without gaining much. So this just adds a "header" struct with both
lengths for all compression algorithms. I think it's cleaner/simpler.

v20250930-0018-undo-unncessary-changes-to-Makefile.patch

- Why did the 0001 patch add this? Maybe it's something we should add
separately, not as part of this patch?

v20250930-0019-enable-compression-for-tuplestore.patch

- Enables compression for tuplestores that don't require random access.
This covers e.g. tuplestores produces by SRF like generate_series, etc.

- I still wonder what would it take to support random access. What if we
remember offsets of each block? We could write that into an uncompressed
file. That'd be 128kB per 1GB, which seems acceptable. Just a thought.

v20250930-0020-remember-compression-method-for-each-file.patch

- The code only tracked bool "compress" flag for each file, and then
determined the algorithm during compression/decompression based on the
GUC variable. But that's incorrect, because the GUC can change during
the file life time. For example, there can be a cursor, anb you can do
SET temp_file_compression between the FETCH calls (see the commit
message for an example).

- So this replaces the flag with storing the actual method.

v20250930-0021-LZ4_compress_default-returns-0-on-error.patch

- The LZ4_compress_default returns 0 in case of error. Probably a bug
introduced by one of my earlier patches.

v20250930-0022-try-LZ4_compress_fast.patch

- Experimental patch, trying a faster LZ4 compression.

So that's what I have at the moment. I'm also doing some testing,
measuring the effect of compression both for trivial queries (based on
generate_series) and more complex ones from TPC-H.

I'll post the complete results when I have that, but the results I've
seen so far show that:

- pglz and lz4 end up with about the same compression ratio (in TPC-H
it's often cutting the temp files in about half)

- lz4 is on par with no compression (it's pretty much within noise),
while pglz is much slower (sometimes ~2x slower)

I wonder how would gzip/zstandard perform. My guess would be that gzip
would be faster than pglz, but still slower than lz4. Zstandard is much
closer to lz4. Would it be possible to have some experimental patches
for gzip/zstd, so that we can try? It'd also validate that the code is
prepared for adding more algorithms in the future.

The other thing I was thinking about was the LZ4 stream compression.
There's a comment suggesting it might compress better, and indeed - when
working on pg_dump compression we saw a huge improvement. Again, would
be great to support have an experimental patch for this, so that we can
evaluate it.

regards

--
Tomas Vondra

Attachments:

v20250930-0007-improve-BufFileCreateTemp-BufFileCreateCom.patchtext/x-patch; charset=UTF-8; name=v20250930-0007-improve-BufFileCreateTemp-BufFileCreateCom.patchDownload+20-11
v20250930-0008-BufFileCreateCompressTemp-cleanup-and-comm.patchtext/x-patch; charset=UTF-8; name=v20250930-0008-BufFileCreateCompressTemp-cleanup-and-comm.patchDownload+21-12
v20250930-0001-Add-transparent-compression-for-temporary-.patchtext/x-patch; charset=UTF-8; name=v20250930-0001-Add-transparent-compression-for-temporary-.patchDownload+338-32
v20250930-0002-whitespace.patchtext/x-patch; charset=UTF-8; name=v20250930-0002-whitespace.patchDownload+10-11
v20250930-0003-pgindent.patchtext/x-patch; charset=UTF-8; name=v20250930-0003-pgindent.patchDownload+96-73
v20250930-0004-Add-regression-tests-for-temporary-file-co.patchtext/x-patch; charset=UTF-8; name=v20250930-0004-Add-regression-tests-for-temporary-file-co.patchDownload+3591-10
v20250930-0005-remove-unused-BufFile-compress_tempfile.patchtext/x-patch; charset=UTF-8; name=v20250930-0005-remove-unused-BufFile-compress_tempfile.patchDownload+0-3
v20250930-0006-simplify-BufFileCreateTemp-interface.patchtext/x-patch; charset=UTF-8; name=v20250930-0006-simplify-BufFileCreateTemp-interface.patchDownload+12-14
v20250930-0009-minor-BufFileLoadBuffer-cleanup.patchtext/x-patch; charset=UTF-8; name=v20250930-0009-minor-BufFileLoadBuffer-cleanup.patchDownload+10-13
v20250930-0010-BufFileLoadBuffer-simpler-FileRead-handlin.patchtext/x-patch; charset=UTF-8; name=v20250930-0010-BufFileLoadBuffer-simpler-FileRead-handlin.patchDownload+8-6
v20250930-0011-BufFileLoadBuffer-simpler-FileRead-handlin.patchtext/x-patch; charset=UTF-8; name=v20250930-0011-BufFileLoadBuffer-simpler-FileRead-handlin.patchDownload+6-8
v20250930-0012-BufFileLoadBuffer-comment-update.patchtext/x-patch; charset=UTF-8; name=v20250930-0012-BufFileLoadBuffer-comment-update.patchDownload+1-2
v20250930-0013-BufFileLoadBuffer-simplify-skipping-header.patchtext/x-patch; charset=UTF-8; name=v20250930-0013-BufFileLoadBuffer-simplify-skipping-header.patchDownload+16-18
v20250930-0014-BufFileDumpBuffer-cleanup-simplification.patchtext/x-patch; charset=UTF-8; name=v20250930-0014-BufFileDumpBuffer-cleanup-simplification.patchDownload+37-46
v20250930-0015-BufFileLoadBuffer-comment.patchtext/x-patch; charset=UTF-8; name=v20250930-0015-BufFileLoadBuffer-comment.patchDownload+3-1
v20250930-0016-BufFileLoadBuffer-missing-FileRead-error-h.patchtext/x-patch; charset=UTF-8; name=v20250930-0016-BufFileLoadBuffer-missing-FileRead-error-h.patchDownload+1-1
v20250930-0017-simplify-the-compression-header.patchtext/x-patch; charset=UTF-8; name=v20250930-0017-simplify-the-compression-header.patchDownload+121-103
v20250930-0018-undo-unncessary-changes-to-Makefile.patchtext/x-patch; charset=UTF-8; name=v20250930-0018-undo-unncessary-changes-to-Makefile.patchDownload+0-2
v20250930-0019-enable-compression-for-tuplestore.patchtext/x-patch; charset=UTF-8; name=v20250930-0019-enable-compression-for-tuplestore.patchDownload+7-2
v20250930-0020-remember-compression-method-for-each-file.patchtext/x-patch; charset=UTF-8; name=v20250930-0020-remember-compression-method-for-each-file.patchDownload+10-8
v20250930-0021-LZ4_compress_default-returns-0-on-error.patchtext/x-patch; charset=UTF-8; name=v20250930-0021-LZ4_compress_default-returns-0-on-error.patchDownload+1-2
v20250930-0022-try-LZ4_compress_fast.patchtext/x-patch; charset=UTF-8; name=v20250930-0022-try-LZ4_compress_fast.patchDownload+4-4
#20Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#19)
Re: Proposal: Adding compression of temporary files

Hi,

On 9/30/25 14:42, Tomas Vondra wrote:

v20250930-0018-undo-unncessary-changes-to-Makefile.patch

- Why did the 0001 patch add this? Maybe it's something we should add
separately, not as part of this patch?

I realized this bit is actually necessary, to make the EXTRA_TESTS work
for the lz4 regression test. The attached patch series skips this bit.

There's also experimental patches adding gzip (or rather libz) and zstd
compression. This is very rough, I just wanted to see how would these
perform compared to pglz/lz4. But I haven't done any proper evaluation
so far, beyond running a couple simple queries. Will try to spend a bit
more time on that soon.

I still wonder about the impact of stream compression. I know it can
improve the compression ratio, but I'm not sure if it also helps with
the compression speed. I think for temporary files faster compression
(and lower ratio) may be a better trade off. So maybe we should user
lower compression levels ...

Attached are two PDF files with results of the perf evaluation using
TPC-H 10GB and 50GB data sets. One table shows timings for 22 queries
with compression set to no/pglz/lz4, for a range of parameter
combinations (work_mem, parallel workers). The other shows amount of
temporary files (in MBs) generated by each query.

The timing shows that pglz is pretty slow, about doubling duration for
some of the queries. That's not surprising, we know pglz can be slow.
lz4 is almost perfectly neutral, which is actually great - the goal is
to reduce I/O pressure for temporary files, but with a single query
running at a time, that's not a problem. So "no impact" is about the
best we can do, it shows the lz4 overhead is negligible.

For "size" PDF shows that the compression can save a fair amount of temp
space. For many queries it saves 50-70% of temporary space. A good
example is Q9 which (on the 50GB scale) used to take about 33GB, and
with compression it's down to ~17GB (with both pglz and lz4). That's
pretty good, I think.

FWIW the "size" results may be a bit misleading, in that it measures
tempfile size for the whole query. But some may use multiple temporary
files, and some may not support compression (e.g. tuplesort don't).
Which will make the actual compression ratio look lower. OTOH it's a
more representative of impact on actual queries.

regards

--
Tomas Vondra

Attachments:

v20251001-0001-Add-transparent-compression-for-temporary-.patchtext/x-patch; charset=UTF-8; name=v20251001-0001-Add-transparent-compression-for-temporary-.patchDownload+338-32
v20251001-0002-whitespace.patchtext/x-patch; charset=UTF-8; name=v20251001-0002-whitespace.patchDownload+10-11
v20251001-0003-pgindent.patchtext/x-patch; charset=UTF-8; name=v20251001-0003-pgindent.patchDownload+96-73
v20251001-0004-Add-regression-tests-for-temporary-file-co.patchtext/x-patch; charset=UTF-8; name=v20251001-0004-Add-regression-tests-for-temporary-file-co.patchDownload+3591-10
v20251001-0005-remove-unused-BufFile-compress_tempfile.patchtext/x-patch; charset=UTF-8; name=v20251001-0005-remove-unused-BufFile-compress_tempfile.patchDownload+0-3
v20251001-0006-simplify-BufFileCreateTemp-interface.patchtext/x-patch; charset=UTF-8; name=v20251001-0006-simplify-BufFileCreateTemp-interface.patchDownload+12-14
v20251001-0007-improve-BufFileCreateTemp-BufFileCreateCom.patchtext/x-patch; charset=UTF-8; name=v20251001-0007-improve-BufFileCreateTemp-BufFileCreateCom.patchDownload+20-11
v20251001-0008-BufFileCreateCompressTemp-cleanup-and-comm.patchtext/x-patch; charset=UTF-8; name=v20251001-0008-BufFileCreateCompressTemp-cleanup-and-comm.patchDownload+21-12
v20251001-0009-minor-BufFileLoadBuffer-cleanup.patchtext/x-patch; charset=UTF-8; name=v20251001-0009-minor-BufFileLoadBuffer-cleanup.patchDownload+10-13
v20251001-0010-BufFileLoadBuffer-simpler-FileRead-handlin.patchtext/x-patch; charset=UTF-8; name=v20251001-0010-BufFileLoadBuffer-simpler-FileRead-handlin.patchDownload+8-6
v20251001-0011-BufFileLoadBuffer-simpler-FileRead-handlin.patchtext/x-patch; charset=UTF-8; name=v20251001-0011-BufFileLoadBuffer-simpler-FileRead-handlin.patchDownload+6-8
v20251001-0012-BufFileLoadBuffer-comment-update.patchtext/x-patch; charset=UTF-8; name=v20251001-0012-BufFileLoadBuffer-comment-update.patchDownload+1-2
v20251001-0013-BufFileLoadBuffer-simplify-skipping-header.patchtext/x-patch; charset=UTF-8; name=v20251001-0013-BufFileLoadBuffer-simplify-skipping-header.patchDownload+16-18
v20251001-0014-BufFileDumpBuffer-cleanup-simplification.patchtext/x-patch; charset=UTF-8; name=v20251001-0014-BufFileDumpBuffer-cleanup-simplification.patchDownload+37-46
v20251001-0022-experimental-zlib-gzip-compression.patchtext/x-patch; charset=UTF-8; name=v20251001-0022-experimental-zlib-gzip-compression.patchDownload+55-2
v20251001-0023-experimental-zstd-compression.patchtext/x-patch; charset=UTF-8; name=v20251001-0023-experimental-zstd-compression.patchDownload+55-2
v20251001-0015-BufFileLoadBuffer-comment.patchtext/x-patch; charset=UTF-8; name=v20251001-0015-BufFileLoadBuffer-comment.patchDownload+3-1
v20251001-0016-BufFileLoadBuffer-missing-FileRead-error-h.patchtext/x-patch; charset=UTF-8; name=v20251001-0016-BufFileLoadBuffer-missing-FileRead-error-h.patchDownload+1-1
v20251001-0017-simplify-the-compression-header.patchtext/x-patch; charset=UTF-8; name=v20251001-0017-simplify-the-compression-header.patchDownload+121-103
v20251001-0018-enable-compression-for-tuplestore.patchtext/x-patch; charset=UTF-8; name=v20251001-0018-enable-compression-for-tuplestore.patchDownload+7-2
v20251001-0019-remember-compression-method-for-each-file.patchtext/x-patch; charset=UTF-8; name=v20251001-0019-remember-compression-method-for-each-file.patchDownload+10-8
v20251001-0020-LZ4_compress_default-returns-0-on-error.patchtext/x-patch; charset=UTF-8; name=v20251001-0020-LZ4_compress_default-returns-0-on-error.patchDownload+1-2
v20251001-0021-try-LZ4_compress_fast.patchtext/x-patch; charset=UTF-8; name=v20251001-0021-try-LZ4_compress_fast.patchDownload+4-4
v20251001-0024-add-regression-test-for-gzip-zlib.patchtext/x-patch; charset=UTF-8; name=v20251001-0024-add-regression-test-for-gzip-zlib.patchDownload+1796-1
v20251001-0025-add-regression-test-for-zstd.patchtext/x-patch; charset=UTF-8; name=v20251001-0025-add-regression-test-for-zstd.patchDownload+1797-1
compress-tpch-size.pdfapplication/pdf; name=compress-tpch-size.pdfDownload+1-7
compress-tpch-timing.pdfapplication/pdf; name=compress-tpch-timing.pdfDownload+0-8
#21Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tomas Vondra (#20)
#22Filip Janus
fjanus@redhat.com
In reply to: Alvaro Herrera (#21)
#23lakshmi
lakshmigcdac@gmail.com
In reply to: Filip Janus (#18)
#24Filip Janus
fjanus@redhat.com
In reply to: lakshmi (#23)
#25Filip Janus
fjanus@redhat.com
In reply to: Filip Janus (#24)
#26Zsolt Parragi
zsolt.parragi@percona.com
In reply to: Filip Janus (#25)
#27lakshmi
lakshmigcdac@gmail.com
In reply to: Zsolt Parragi (#26)
#28lakshmi
lakshmigcdac@gmail.com
In reply to: lakshmi (#27)
#29Filip Janus
fjanus@redhat.com
In reply to: lakshmi (#28)
#30Filip Janus
fjanus@redhat.com
In reply to: Filip Janus (#29)
#31lakshmi
lakshmigcdac@gmail.com
In reply to: Filip Janus (#30)
#32Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Filip Janus (#30)