Patch: dumping tables data in multiple chunks in pg_dump

Started by Hannu Krosing7 months ago46 messageshackers

hannu@tm.ee

7 months ago

Attached is a patch that adds the ability to dump table data in multiple chunks.

Looking for feedback at this point:
1) what have I missed
2) should I implement something to avoid single-page chunks

The flag --huge-table-chunk-pages which tells the directory format
dump to dump tables where the main fork has more pages than this in
multiple chunks of given number of pages,

The main use case is speeding up parallel dumps in case of one or a
small number of HUGE tables so parts of these can be dumped in
parallel.

It will also help in case the target file system has some limitations
on file sizes (4GB for FAT, 5TB for GCS).

Currently no tests are included in the patch and also no extra
documentation outside what is printed out by pg_dump --help . Also any
pg_log_warning lines with "CHUNKING" is there for debugging and needs
to be removed before committing.

As implemented no changes are needed for pg_restore as all chunks are
already associated with the table in .toc and thus are restored into
this table

the attached README shows how I verified it works and the textual
file created from the directory format dump in the last step there

--
Hannu

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

7 months ago

In reply to: Hannu Krosing (#1)

Re: Patch: dumping tables data in multiple chunks in pg_dump

Hi Hannu,

On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk@google.com> wrote:

Attached is a patch that adds the ability to dump table data in multiple chunks.

Looking for feedback at this point:
1) what have I missed
2) should I implement something to avoid single-page chunks

The flag --huge-table-chunk-pages which tells the directory format
dump to dump tables where the main fork has more pages than this in
multiple chunks of given number of pages,

The main use case is speeding up parallel dumps in case of one or a
small number of HUGE tables so parts of these can be dumped in
parallel.

Have you measured speed up? Can you please share the numbers?

--
Best Wishes,
Ashutosh Bapat

Hannu Krosing

hannu@tm.ee

7 months ago

In reply to: Ashutosh Bapat (#2)

Re: Patch: dumping tables data in multiple chunks in pg_dump

I just ran a test by generating a 408GB table and then dumping it both ways

$ time pg_dump --format=directory -h 10.58.80.2 -U postgres -f
/tmp/plain.dump largedb

real 39m54.968s
user 37m21.557s
sys 2m32.422s

$ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 8 -f /tmp/parallel8.dump largedb

real 5m52.965s
user 40m27.284s
sys 3m53.339s

So parallel dump with 8 workers using 1GB (128k pages) chunks runs
almost 7 times faster than the sequential dump.

this was a table that had no TOAST part. I will run some more tests
with TOASTed tables next and expect similar or better improvements.

On Wed, Nov 12, 2025 at 1:59 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Show quoted text

Hi Hannu,

On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk@google.com> wrote:

Attached is a patch that adds the ability to dump table data in multiple chunks.

Looking for feedback at this point:
1) what have I missed
2) should I implement something to avoid single-page chunks

The flag --huge-table-chunk-pages which tells the directory format
dump to dump tables where the main fork has more pages than this in
multiple chunks of given number of pages,

The main use case is speeding up parallel dumps in case of one or a
small number of HUGE tables so parts of these can be dumped in
parallel.

Have you measured speed up? Can you please share the numbers?

--
Best Wishes,
Ashutosh Bapat

Hannu Krosing

hannu@tm.ee

7 months ago

In reply to: Hannu Krosing (#3)

Re: Patch: dumping tables data in multiple chunks in pg_dump

Going up to 16 workers did not improve performance , but this is
expected, as the disk behind the database can only do 4TB/hour of
reads, which is now the bottleneck. (408/352/*3600 = 4172 GB/h)

$ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 16 -f /tmp/parallel16.dump largedb
real 5m44.900s
user 53m50.491s
sys 5m47.602s

And 4 workers showed near-linear speedup from single worker

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 4 -f /tmp/parallel4.dump largedb
real 10m32.074s
user 38m54.436s
sys 2m58.216s

The database runs on a 64vCPU VM with 128GB RAM, so most of the table
will be read in from the disk

Show quoted text

On Thu, Nov 13, 2025 at 7:02 PM Hannu Krosing <hannuk@google.com> wrote:

I just ran a test by generating a 408GB table and then dumping it both ways

$ time pg_dump --format=directory -h 10.58.80.2 -U postgres -f
/tmp/plain.dump largedb

real 39m54.968s
user 37m21.557s
sys 2m32.422s

$ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 8 -f /tmp/parallel8.dump largedb

real 5m52.965s
user 40m27.284s
sys 3m53.339s

So parallel dump with 8 workers using 1GB (128k pages) chunks runs
almost 7 times faster than the sequential dump.

this was a table that had no TOAST part. I will run some more tests
with TOASTed tables next and expect similar or better improvements.

On Wed, Nov 12, 2025 at 1:59 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi Hannu,

On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk@google.com> wrote:

Attached is a patch that adds the ability to dump table data in multiple chunks.

Looking for feedback at this point:
1) what have I missed
2) should I implement something to avoid single-page chunks

The flag --huge-table-chunk-pages which tells the directory format
dump to dump tables where the main fork has more pages than this in
multiple chunks of given number of pages,

The main use case is speeding up parallel dumps in case of one or a
small number of HUGE tables so parts of these can be dumped in
parallel.

Have you measured speed up? Can you please share the numbers?

--
Best Wishes,
Ashutosh Bapat

Hannu Krosing

hannu@tm.ee

7 months ago

In reply to: Hannu Krosing (#4)

Re: Patch: dumping tables data in multiple chunks in pg_dump

Ran another test with a 53GB database where most of the data is in TOAST

CREATE TABLE just_toasted(
id serial primary key,
toasted1 char(2200) STORAGE EXTERNAL,
toasted2 char(2200) STORAGE EXTERNAL,
toasted3 char(2200) STORAGE EXTERNAL,
toasted4 char(2200) STORAGE EXTERNAL
);

and the toast fields were added in somewhat randomised order.

Here the results are as follows

Parallelism | chunk size (pages) | time (sec)
1 | - | 240
2 | 1000 | 129
4 | 1000 | 64
8 | 1000 | 36
16 | 1000 | 30

4 | 9095 | 78
8 | 9095 | 42
16 | 9095 | 42

The reason larger chunk sizes performed worse was that they often had
one or two stragglers left behind which

Detailed run results below:

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres -f
/tmp/ltoastdb-1-plain.dump largetoastdb
real 3m59.465s
user 3m43.304s
sys 0m15.844s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=9095 -j 4 -f /tmp/ltoastdb-4.dump
largetoastdb
real 1m18.320s
user 3m49.236s
sys 0m19.422s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=9095 -j 8 -f /tmp/ltoastdb-8.dump
largetoastdb
real 0m42.028s
user 3m55.299s
sys 0m24.657s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=9095 -j 16 -f /tmp/ltoastdb-16.dump
largetoastdb
real 0m42.575s
user 4m11.011s
sys 0m26.110s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 16 -f /tmp/ltoastdb-16-1kpages.dump
largetoastdb
real 0m29.641s
user 6m16.321s
sys 0m49.345s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 8 -f /tmp/ltoastdb-8-1kpages.dump
largetoastdb
real 0m35.685s
user 3m58.528s
sys 0m26.729s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 4 -f /tmp/ltoastdb-4-1kpages.dump
largetoastdb
real 1m3.737s
user 3m50.251s
sys 0m18.507s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 2 -f /tmp/ltoastdb-2-1kpages.dump
largetoastdb
real 2m8.708s
user 3m57.018s
sys 0m18.499s

Show quoted text

On Thu, Nov 13, 2025 at 7:39 PM Hannu Krosing <hannuk@google.com> wrote:

Going up to 16 workers did not improve performance , but this is
expected, as the disk behind the database can only do 4TB/hour of
reads, which is now the bottleneck. (408/352/*3600 = 4172 GB/h)

$ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 16 -f /tmp/parallel16.dump largedb
real 5m44.900s
user 53m50.491s
sys 5m47.602s

And 4 workers showed near-linear speedup from single worker

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 4 -f /tmp/parallel4.dump largedb
real 10m32.074s
user 38m54.436s
sys 2m58.216s

The database runs on a 64vCPU VM with 128GB RAM, so most of the table
will be read in from the disk

On Thu, Nov 13, 2025 at 7:02 PM Hannu Krosing <hannuk@google.com> wrote:

I just ran a test by generating a 408GB table and then dumping it both ways

$ time pg_dump --format=directory -h 10.58.80.2 -U postgres -f
/tmp/plain.dump largedb

real 39m54.968s
user 37m21.557s
sys 2m32.422s

$ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 8 -f /tmp/parallel8.dump largedb

real 5m52.965s
user 40m27.284s
sys 3m53.339s

So parallel dump with 8 workers using 1GB (128k pages) chunks runs
almost 7 times faster than the sequential dump.

this was a table that had no TOAST part. I will run some more tests
with TOASTed tables next and expect similar or better improvements.

On Wed, Nov 12, 2025 at 1:59 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi Hannu,

On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk@google.com> wrote:

Attached is a patch that adds the ability to dump table data in multiple chunks.

Looking for feedback at this point:
1) what have I missed
2) should I implement something to avoid single-page chunks

The flag --huge-table-chunk-pages which tells the directory format
dump to dump tables where the main fork has more pages than this in
multiple chunks of given number of pages,

The main use case is speeding up parallel dumps in case of one or a
small number of HUGE tables so parts of these can be dumped in
parallel.

Have you measured speed up? Can you please share the numbers?

--
Best Wishes,
Ashutosh Bapat

Hannu Krosing

hannu@tm.ee

7 months ago

In reply to: Hannu Krosing (#5)

Re: Patch: dumping tables data in multiple chunks in pg_dump

The reason for small chunk sizes is that they are determined by main
heap table, and that was just over 1GB

largetoastdb=> SELECT format('%I.%I', t.schemaname, t.relname) as table_name,
pg_table_size(t.relid) AS table_size,
sum(pg_relation_size(i.indexrelid)) AS total_index_size,
pg_relation_size(t.relid) AS main_table_size,
pg_relation_size(c.reltoastrelid) AS toast_table_size,
pg_relation_size(oi.indexrelid) AS toast_index_size,
t.n_live_tup AS row_count,
count(*) AS index_count,
array_to_json(array_agg(json_build_object(i.indexrelid::regclass,
pg_relation_size(i.indexrelid))), true) AS index_info
FROM pg_stat_user_tables t
JOIN pg_stat_user_indexes i ON i.relid = t.relid
JOIN pg_class c ON c.oid = t.relid
LEFT JOIN pg_stat_sys_indexes AS oi ON oi.relid = c.reltoastrelid
GROUP BY 1, 2, 4, 5, 6, 7
ORDER BY 2 DESC, 7 DESC
LIMIT 25;
┌─[ RECORD 1 ]─────┬─────────────────────────────────────┐
│ table_name │ public.just_toasted │
│ table_size │ 56718835712 │
│ total_index_size │ 230064128 │
│ main_table_size │ 1191559168 │
│ toast_table_size │ 54613336064 │
│ toast_index_size │ 898465792 │
│ row_count │ 5625234 │
│ index_count │ 1 │
│ index_info │ [{"just_toasted_pkey" : 230064128}] │
└──────────────────┴─────────────────────────────────────┘

Show quoted text

On Thu, Nov 13, 2025 at 9:24 PM Hannu Krosing <hannuk@google.com> wrote:

Ran another test with a 53GB database where most of the data is in TOAST

CREATE TABLE just_toasted(
id serial primary key,
toasted1 char(2200) STORAGE EXTERNAL,
toasted2 char(2200) STORAGE EXTERNAL,
toasted3 char(2200) STORAGE EXTERNAL,
toasted4 char(2200) STORAGE EXTERNAL
);

and the toast fields were added in somewhat randomised order.

Here the results are as follows

Parallelism | chunk size (pages) | time (sec)
1 | - | 240
2 | 1000 | 129
4 | 1000 | 64
8 | 1000 | 36
16 | 1000 | 30

4 | 9095 | 78
8 | 9095 | 42
16 | 9095 | 42

The reason larger chunk sizes performed worse was that they often had
one or two stragglers left behind which

Detailed run results below:

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres -f
/tmp/ltoastdb-1-plain.dump largetoastdb
real 3m59.465s
user 3m43.304s
sys 0m15.844s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=9095 -j 4 -f /tmp/ltoastdb-4.dump
largetoastdb
real 1m18.320s
user 3m49.236s
sys 0m19.422s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=9095 -j 8 -f /tmp/ltoastdb-8.dump
largetoastdb
real 0m42.028s
user 3m55.299s
sys 0m24.657s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=9095 -j 16 -f /tmp/ltoastdb-16.dump
largetoastdb
real 0m42.575s
user 4m11.011s
sys 0m26.110s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 16 -f /tmp/ltoastdb-16-1kpages.dump
largetoastdb
real 0m29.641s
user 6m16.321s
sys 0m49.345s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 8 -f /tmp/ltoastdb-8-1kpages.dump
largetoastdb
real 0m35.685s
user 3m58.528s
sys 0m26.729s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 4 -f /tmp/ltoastdb-4-1kpages.dump
largetoastdb
real 1m3.737s
user 3m50.251s
sys 0m18.507s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 2 -f /tmp/ltoastdb-2-1kpages.dump
largetoastdb
real 2m8.708s
user 3m57.018s
sys 0m18.499s

On Thu, Nov 13, 2025 at 7:39 PM Hannu Krosing <hannuk@google.com> wrote:

Going up to 16 workers did not improve performance , but this is
expected, as the disk behind the database can only do 4TB/hour of
reads, which is now the bottleneck. (408/352/*3600 = 4172 GB/h)

$ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 16 -f /tmp/parallel16.dump largedb
real 5m44.900s
user 53m50.491s
sys 5m47.602s

And 4 workers showed near-linear speedup from single worker

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 4 -f /tmp/parallel4.dump largedb
real 10m32.074s
user 38m54.436s
sys 2m58.216s

The database runs on a 64vCPU VM with 128GB RAM, so most of the table
will be read in from the disk

On Thu, Nov 13, 2025 at 7:02 PM Hannu Krosing <hannuk@google.com> wrote:

I just ran a test by generating a 408GB table and then dumping it both ways

$ time pg_dump --format=directory -h 10.58.80.2 -U postgres -f
/tmp/plain.dump largedb

real 39m54.968s
user 37m21.557s
sys 2m32.422s

$ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 8 -f /tmp/parallel8.dump largedb

real 5m52.965s
user 40m27.284s
sys 3m53.339s

So parallel dump with 8 workers using 1GB (128k pages) chunks runs
almost 7 times faster than the sequential dump.

this was a table that had no TOAST part. I will run some more tests
with TOASTed tables next and expect similar or better improvements.

On Wed, Nov 12, 2025 at 1:59 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi Hannu,

On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk@google.com> wrote:

Attached is a patch that adds the ability to dump table data in multiple chunks.

Looking for feedback at this point:
1) what have I missed
2) should I implement something to avoid single-page chunks

The flag --huge-table-chunk-pages which tells the directory format
dump to dump tables where the main fork has more pages than this in
multiple chunks of given number of pages,

The main use case is speeding up parallel dumps in case of one or a
small number of HUGE tables so parts of these can be dumped in
parallel.

Have you measured speed up? Can you please share the numbers?

--
Best Wishes,
Ashutosh Bapat

Hannu Krosing

hannu@tm.ee

7 months ago

In reply to: Hannu Krosing (#6)

Re: Patch: dumping tables data in multiple chunks in pg_dump

Added to https://commitfest.postgresql.org/patch/6219/

Show quoted text

On Thu, Nov 13, 2025 at 9:26 PM Hannu Krosing <hannuk@google.com> wrote:

The reason for small chunk sizes is that they are determined by main
heap table, and that was just over 1GB

largetoastdb=> SELECT format('%I.%I', t.schemaname, t.relname) as table_name,
pg_table_size(t.relid) AS table_size,
sum(pg_relation_size(i.indexrelid)) AS total_index_size,
pg_relation_size(t.relid) AS main_table_size,
pg_relation_size(c.reltoastrelid) AS toast_table_size,
pg_relation_size(oi.indexrelid) AS toast_index_size,
t.n_live_tup AS row_count,
count(*) AS index_count,
array_to_json(array_agg(json_build_object(i.indexrelid::regclass,
pg_relation_size(i.indexrelid))), true) AS index_info
FROM pg_stat_user_tables t
JOIN pg_stat_user_indexes i ON i.relid = t.relid
JOIN pg_class c ON c.oid = t.relid
LEFT JOIN pg_stat_sys_indexes AS oi ON oi.relid = c.reltoastrelid
GROUP BY 1, 2, 4, 5, 6, 7
ORDER BY 2 DESC, 7 DESC
LIMIT 25;
┌─[ RECORD 1 ]─────┬─────────────────────────────────────┐
│ table_name │ public.just_toasted │
│ table_size │ 56718835712 │
│ total_index_size │ 230064128 │
│ main_table_size │ 1191559168 │
│ toast_table_size │ 54613336064 │
│ toast_index_size │ 898465792 │
│ row_count │ 5625234 │
│ index_count │ 1 │
│ index_info │ [{"just_toasted_pkey" : 230064128}] │
└──────────────────┴─────────────────────────────────────┘

On Thu, Nov 13, 2025 at 9:24 PM Hannu Krosing <hannuk@google.com> wrote:

Ran another test with a 53GB database where most of the data is in TOAST

CREATE TABLE just_toasted(
id serial primary key,
toasted1 char(2200) STORAGE EXTERNAL,
toasted2 char(2200) STORAGE EXTERNAL,
toasted3 char(2200) STORAGE EXTERNAL,
toasted4 char(2200) STORAGE EXTERNAL
);

and the toast fields were added in somewhat randomised order.

Here the results are as follows

Parallelism | chunk size (pages) | time (sec)
1 | - | 240
2 | 1000 | 129
4 | 1000 | 64
8 | 1000 | 36
16 | 1000 | 30

4 | 9095 | 78
8 | 9095 | 42
16 | 9095 | 42

The reason larger chunk sizes performed worse was that they often had
one or two stragglers left behind which

Detailed run results below:

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres -f
/tmp/ltoastdb-1-plain.dump largetoastdb
real 3m59.465s
user 3m43.304s
sys 0m15.844s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=9095 -j 4 -f /tmp/ltoastdb-4.dump
largetoastdb
real 1m18.320s
user 3m49.236s
sys 0m19.422s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=9095 -j 8 -f /tmp/ltoastdb-8.dump
largetoastdb
real 0m42.028s
user 3m55.299s
sys 0m24.657s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=9095 -j 16 -f /tmp/ltoastdb-16.dump
largetoastdb
real 0m42.575s
user 4m11.011s
sys 0m26.110s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 16 -f /tmp/ltoastdb-16-1kpages.dump
largetoastdb
real 0m29.641s
user 6m16.321s
sys 0m49.345s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 8 -f /tmp/ltoastdb-8-1kpages.dump
largetoastdb
real 0m35.685s
user 3m58.528s
sys 0m26.729s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 4 -f /tmp/ltoastdb-4-1kpages.dump
largetoastdb
real 1m3.737s
user 3m50.251s
sys 0m18.507s

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=1000 -j 2 -f /tmp/ltoastdb-2-1kpages.dump
largetoastdb
real 2m8.708s
user 3m57.018s
sys 0m18.499s

On Thu, Nov 13, 2025 at 7:39 PM Hannu Krosing <hannuk@google.com> wrote:

Going up to 16 workers did not improve performance , but this is
expected, as the disk behind the database can only do 4TB/hour of
reads, which is now the bottleneck. (408/352/*3600 = 4172 GB/h)

$ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 16 -f /tmp/parallel16.dump largedb
real 5m44.900s
user 53m50.491s
sys 5m47.602s

And 4 workers showed near-linear speedup from single worker

hannuk@pgn2:~/work/postgres/src/bin/pg_dump$ time ./pg_dump
--format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 4 -f /tmp/parallel4.dump largedb
real 10m32.074s
user 38m54.436s
sys 2m58.216s

The database runs on a 64vCPU VM with 128GB RAM, so most of the table
will be read in from the disk

On Thu, Nov 13, 2025 at 7:02 PM Hannu Krosing <hannuk@google.com> wrote:

I just ran a test by generating a 408GB table and then dumping it both ways

$ time pg_dump --format=directory -h 10.58.80.2 -U postgres -f
/tmp/plain.dump largedb

real 39m54.968s
user 37m21.557s
sys 2m32.422s

$ time ./pg_dump --format=directory -h 10.58.80.2 -U postgres
--huge-table-chunk-pages=131072 -j 8 -f /tmp/parallel8.dump largedb

real 5m52.965s
user 40m27.284s
sys 3m53.339s

So parallel dump with 8 workers using 1GB (128k pages) chunks runs
almost 7 times faster than the sequential dump.

this was a table that had no TOAST part. I will run some more tests
with TOASTed tables next and expect similar or better improvements.

On Wed, Nov 12, 2025 at 1:59 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi Hannu,

On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk@google.com> wrote:

Attached is a patch that adds the ability to dump table data in multiple chunks.

Looking for feedback at this point:
1) what have I missed
2) should I implement something to avoid single-page chunks

The flag --huge-table-chunk-pages which tells the directory format
dump to dump tables where the main fork has more pages than this in
multiple chunks of given number of pages,

The main use case is speeding up parallel dumps in case of one or a
small number of HUGE tables so parts of these can be dumped in
parallel.

Have you measured speed up? Can you please share the numbers?

--
Best Wishes,
Ashutosh Bapat

Dilip Kumar

dilipbalaut@gmail.com

6 months ago

In reply to: Hannu Krosing (#1)

Re: Patch: dumping tables data in multiple chunks in pg_dump

On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk@google.com> wrote:

Attached is a patch that adds the ability to dump table data in multiple chunks.

Looking for feedback at this point:
1) what have I missed
2) should I implement something to avoid single-page chunks

The flag --huge-table-chunk-pages which tells the directory format
dump to dump tables where the main fork has more pages than this in
multiple chunks of given number of pages,

The main use case is speeding up parallel dumps in case of one or a
small number of HUGE tables so parts of these can be dumped in
parallel.

+1 for the idea, I haven't done the detailed review but I was just
going through the patch, I noticed that we use pg_class->relpages to
identify whether to chunk the table or not, which should be fine but
don't you think if we use direct size calculation function like
pg_relation_size() we might get better idea and not dependent upon
whether the stats are updated or not? This will make chunking
behavior more deterministic.

--
Regards,
Dilip Kumar
Google

Hannu Krosing

hannu@tm.ee

6 months ago

In reply to: Dilip Kumar (#8)

Re: Patch: dumping tables data in multiple chunks in pg_dump

The expectation was that as chunking is useful mainly in case of
really huge tables the analyze should have been run "recently enough".

Maybe we should use pg_relation_size() in case we have already
determined that the table is large enough to warrant chunking? Maybe
at least 1/2 of the requested chunk size?

My reasoning was to not put too much extra load on pg_dump in case
chunking is not required. But of course we can use the presence of a
chunking request to decide to run pg_relation_size(), assuming the
overhead won't be too large in this case.

Show quoted text

On Mon, Nov 17, 2025 at 5:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk@google.com> wrote:

Attached is a patch that adds the ability to dump table data in multiple chunks.

Looking for feedback at this point:
1) what have I missed
2) should I implement something to avoid single-page chunks

The flag --huge-table-chunk-pages which tells the directory format
dump to dump tables where the main fork has more pages than this in
multiple chunks of given number of pages,

The main use case is speeding up parallel dumps in case of one or a
small number of HUGE tables so parts of these can be dumped in
parallel.

+1 for the idea, I haven't done the detailed review but I was just
going through the patch, I noticed that we use pg_class->relpages to
identify whether to chunk the table or not, which should be fine but
don't you think if we use direct size calculation function like
pg_relation_size() we might get better idea and not dependent upon
whether the stats are updated or not? This will make chunking
behavior more deterministic.

--
Regards,
Dilip Kumar
Google

#10

Dilip Kumar

dilipbalaut@gmail.com

6 months ago

In reply to: Hannu Krosing (#9)

Re: Patch: dumping tables data in multiple chunks in pg_dump

On Tue, Nov 25, 2025 at 2:32 AM Hannu Krosing <hannuk@google.com> wrote:

The expectation was that as chunking is useful mainly in case of
really huge tables the analyze should have been run "recently enough".

Maybe we should use pg_relation_size() in case we have already
determined that the table is large enough to warrant chunking? Maybe
at least 1/2 of the requested chunk size?

My reasoning was to not put too much extra load on pg_dump in case
chunking is not required. But of course we can use the presence of a
chunking request to decide to run pg_relation_size(), assuming the
overhead won't be too large in this case.

On Mon, Nov 17, 2025 at 5:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Nov 11, 2025 at 9:00 PM Hannu Krosing <hannuk@google.com> wrote:

Attached is a patch that adds the ability to dump table data in multiple chunks.

Looking for feedback at this point:
1) what have I missed
2) should I implement something to avoid single-page chunks

The flag --huge-table-chunk-pages which tells the directory format
dump to dump tables where the main fork has more pages than this in
multiple chunks of given number of pages,

The main use case is speeding up parallel dumps in case of one or a
small number of HUGE tables so parts of these can be dumped in
parallel.

+1 for the idea, I haven't done the detailed review but I was just
going through the patch, I noticed that we use pg_class->relpages to
identify whether to chunk the table or not, which should be fine but
don't you think if we use direct size calculation function like
pg_relation_size() we might get better idea and not dependent upon
whether the stats are updated or not? This will make chunking
behavior more deterministic.

Yeah that makes sense, we can use relpages for initial identification
and then use pg_relation_size() if relpages says the table is large
enough.

--
Regards,
Dilip Kumar
Google

#11

Pavel Stehule

pavel.stehule@gmail.com

5 months ago

In reply to: Hannu Krosing (#1)

Re: Patch: dumping tables data in multiple chunks in pg_dump

pá 12. 12. 2025 v 9:02 odesílatel Hannu Krosing <hannuk@google.com> napsal:

Attached is a patch that adds the ability to dump table data in multiple
chunks.

Looking for feedback at this point:
1) what have I missed
2) should I implement something to avoid single-page chunks

The flag --huge-table-chunk-pages which tells the directory format
dump to dump tables where the main fork has more pages than this in
multiple chunks of given number of pages,

The main use case is speeding up parallel dumps in case of one or a
small number of HUGE tables so parts of these can be dumped in
parallel.

It will also help in case the target file system has some limitations
on file sizes (4GB for FAT, 5TB for GCS).

Currently no tests are included in the patch and also no extra
documentation outside what is printed out by pg_dump --help . Also any
pg_log_warning lines with "CHUNKING" is there for debugging and needs
to be removed before committing.

As implemented no changes are needed for pg_restore as all chunks are
already associated with the table in .toc and thus are restored into
this table

the attached README shows how I verified it works and the textual
file created from the directory format dump in the last step there

I did first look on this patch and there are some white space issues

Regards

Pavel

Show quoted text

--
Hannu

#12

David Rowley

dgrowleyml@gmail.com

5 months ago

In reply to: Hannu Krosing (#7)

Re: Patch: dumping tables data in multiple chunks in pg_dump

On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <hannuk@google.com> wrote:

Added to https://commitfest.postgresql.org/patch/6219/

I think this could be useful, but I think you'll need to find a way to
not do this for non-heap tables. Per the comments in TableAmRoutine,
both scan_set_tidrange and scan_getnextslot_tidrange are optional
callback functions and the planner won't produce TIDRangePaths if
either of those don't exist. Maybe that means you need to consult
pg_class.relam to ensure the amname is 'heap' or at least the relam =
2. On testing Citus's columnar AM, I get:

postgres=# select * from t where ctid between '(0,1)' and '(10,0)';
ERROR: UPDATE and CTID scans not supported for ColumnarScan

1. For the patch, I think you should tighten the new option up to mean
the maximum segment size that a table will be dumped in. I see you
have comments like:

/* TODO: add hysteresis here, maybe < 1.1 * huge_table_chunk_pages */

You *have* to put the cutoff *somewhere*, so I think it very much
should be exactly the specified threshold. If anyone is unhappy that
some segments consist of a single page, then that's on them to adjust
the parameter accordingly. Otherwise, someone complaints that they got
a 1-page segment when the table was 10.0001% bigger than the cutoff
and then we're tempted to add a new setting to control the 1.1 factor,
which is just silly. If there's a 1-page segment, so what? It's not a
big deal.

Perhaps --max-table-segment-pages is a better name than
--huge-table-chunk-pages as it's quite subjective what the minimum
number of pages required to make a table "huge".

2. I'm not sure if you're going to get away with using relpages for
this. Is it really that bad to query pg_relation_size() when this
option is set? If it really is a problem, then maybe let the user
choose with another option. I understand we're using relpages for
sorting table sizes so we prefer dumping larger tables first, but that
just seems way less important if it's not perfectly accurate.

3. You should be able to simplify the code in dumpTableData() so
you're not adding any extra cases. You could use InvalidBlockNumber to
indicate an unbounded ctid range and only add ctid qual to the WHERE
clause when you have a bounded range (i.e not InvalidBlockNumber).
That way the first segment will need WHERE ctid <= '...' and the final
one will need WHERE ctid >= '...'. Everything in between will have an
upper and lower bound. That results in no ctid quals being added when
both ranges are set to InvalidBlockNumber, which you should use for
all tables not large enough to be segmented, thus no special case.

TID Range scans are perfectly capable of working when only bounded at one side.

4. I think using "int" here is a future complaint waiting to happen.

+ if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
+   &dopt.huge_table_chunk_pages))

I bet we'll eventually see a complaint that someone can't make the
segment size larger than 16TB. I think option_parse_uint32() might be
called for.

David

#13

David Rowley

dgrowleyml@gmail.com

5 months ago

In reply to: Hannu Krosing (#7)

Re: Patch: dumping tables data in multiple chunks in pg_dump

On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <hannuk@google.com> wrote:

Added to https://commitfest.postgresql.org/patch/6219/

I see you added the "Backport" tag in the CF. This isn't the sort of
thing that we'd do that for. Was that a mistake?

David

#14

Hannu Krosing

hannu@tm.ee

5 months ago

In reply to: David Rowley (#12)

Re: Patch: dumping tables data in multiple chunks in pg_dump

On Tue, Jan 13, 2026 at 3:27 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <hannuk@google.com> wrote:

Added to https://commitfest.postgresql.org/patch/6219/

I think this could be useful, but I think you'll need to find a way to
not do this for non-heap tables. Per the comments in TableAmRoutine,
both scan_set_tidrange and scan_getnextslot_tidrange are optional
callback functions and the planner won't produce TIDRangePaths if
either of those don't exist. Maybe that means you need to consult
pg_class.relam to ensure the amname is 'heap' or at least the relam =
2.

Makes sense, will add.

On testing Citus's columnar AM, I get:
postgres=# select * from t where ctid between '(0,1)' and '(10,0)';
ERROR: UPDATE and CTID scans not supported for ColumnarScan

Should we just silently not chunk tables that have some storage
architecture that does not have tids, or should pg_dump just error out
in thiscase ?

I imagine the Citus columnar is often used with huge tables where
chunking would be most useful.

Later it likely makes sense to have another option for chunking other
types of tables, or maybe evan add something to the TableAM for
chunking support.

1. For the patch, I think you should tighten the new option up to mean
the maximum segment size that a table will be dumped in. I see you
have comments like:

/* TODO: add hysteresis here, maybe < 1.1 * huge_table_chunk_pages */

You *have* to put the cutoff *somewhere*, so I think it very much
should be exactly the specified threshold. If anyone is unhappy that
some segments consist of a single page, then that's on them to adjust
the parameter accordingly. Otherwise, someone complaints that they got
a 1-page segment when the table was 10.0001% bigger than the cutoff
and then we're tempted to add a new setting to control the 1.1 factor,
which is just silly. If there's a 1-page segment, so what? It's not a
big deal.

Agreed, will drop the TODO

Perhaps --max-table-segment-pages is a better name than
--huge-table-chunk-pages as it's quite subjective what the minimum
number of pages required to make a table "huge".

I agree. My initial thinking was that it is mainly useful for huge
tables, but indeed that does not need to be reflected in the flag name

2. I'm not sure if you're going to get away with using relpages for
this. Is it really that bad to query pg_relation_size() when this
option is set? If it really is a problem, then maybe let the user
choose with another option. I understand we're using relpages for
sorting table sizes so we prefer dumping larger tables first, but that
just seems way less important if it's not perfectly accurate.

Yeah, I had thought of pg_relation_size() myself.

Another option would be something more complex which tries to estimate
the dump file sizes by figuring out TOAST for each chunk. The think
that makes this really complex is the possible uneven distribution of
toast and needing to take into account both the compression of toast
AND the compression of resulting dump file.

3. You should be able to simplify the code in dumpTableData() so
you're not adding any extra cases. You could use InvalidBlockNumber to
indicate an unbounded ctid range and only add ctid qual to the WHERE
clause when you have a bounded range (i.e not InvalidBlockNumber).
That way the first segment will need WHERE ctid <= '...' and the final
one will need WHERE ctid >= '...'. Everything in between will have an
upper and lower bound. That results in no ctid quals being added when
both ranges are set to InvalidBlockNumber, which you should use for
all tables not large enough to be segmented, thus no special case.

Makes sense, will look into it.

TID Range scans are perfectly capable of working when only bounded at one side.

4. I think using "int" here is a future complaint waiting to happen.
+ if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
+   &dopt.huge_table_chunk_pages))
I bet we'll eventually see a complaint that someone can't make the
segment size larger than 16TB. I think option_parse_uint32() might be
called for.

There can be no more than 2 * INT2_MAX pages anyway.
I thought half of the max possible size should be enough.
Do you really think that somebody would want that ?

Show quoted text

David

#15

Hannu Krosing

hannu@tm.ee

5 months ago

In reply to: David Rowley (#13)

Re: Patch: dumping tables data in multiple chunks in pg_dump

On Tue, Jan 13, 2026 at 3:32 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <hannuk@google.com> wrote:

Added to https://commitfest.postgresql.org/patch/6219/

I see you added the "Backport" tag in the CF. This isn't the sort of
thing that we'd do that for. Was that a mistake?

Just wanted to mark it as something that might be backported later as
one of the important cases for pg_dump is dumping older databases .

Will remove if the Tag means something else.

#16

David Rowley

dgrowleyml@gmail.com

5 months ago

In reply to: Hannu Krosing (#14)

Re: Patch: dumping tables data in multiple chunks in pg_dump

On Wed, 14 Jan 2026 at 23:52, Hannu Krosing <hannuk@google.com> wrote:

On Tue, Jan 13, 2026 at 3:27 AM David Rowley <dgrowleyml@gmail.com> wrote:

On testing Citus's columnar AM, I get:
postgres=# select * from t where ctid between '(0,1)' and '(10,0)';
ERROR: UPDATE and CTID scans not supported for ColumnarScan

Should we just silently not chunk tables that have some storage
architecture that does not have tids, or should pg_dump just error out
in thiscase ?

I think you should just document that it only applies to heap tables.
I don't think erroring out is useful to anyone, especially if the
error only arrives after pg_dump has been running for several hours or
even days.

4. I think using "int" here is a future complaint waiting to happen.
+ if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
+   &dopt.huge_table_chunk_pages))
I bet we'll eventually see a complaint that someone can't make the
segment size larger than 16TB. I think option_parse_uint32() might be
called for.
There can be no more than 2 * INT2_MAX pages anyway.
I thought half of the max possible size should be enough.
Do you really think that somebody would want that ?

IMO, if the option can't represent the full range of BlockNumber, then
that's a bug.

I see pg_resetwal has recently invented strtouint32_strict for this.
It might be a good idea to refactor that and put it into
option_utils.c rather than having each client app have to invent their
own method.

David

#17

Andres Freund

andres@anarazel.de

5 months ago

In reply to: Hannu Krosing (#15)

Re: Patch: dumping tables data in multiple chunks in pg_dump

On 2026-01-14 11:52:54 +0100, Hannu Krosing wrote:

On Tue, Jan 13, 2026 at 3:32 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <hannuk@google.com> wrote:

Added to https://commitfest.postgresql.org/patch/6219/

I see you added the "Backport" tag in the CF. This isn't the sort of
thing that we'd do that for. Was that a mistake?

Just wanted to mark it as something that might be backported later as
one of the important cases for pg_dump is dumping older databases .

I think it's obvious that nothing that's being discussed here has any business
being considered for backporting.

Nor can I follow the argument that dumping of old databases would be an
argument, as the proposed change is in pg_dump, not the server, so a new
pg_dump will suffice to get the benefit, no?

#18

Hannu Krosing

hannu@tm.ee

5 months ago

In reply to: Andres Freund (#17)

Re: Patch: dumping tables data in multiple chunks in pg_dump

On Wed, Jan 14, 2026 at 10:40 PM Andres Freund <andres@anarazel.de> wrote:

On 2026-01-14 11:52:54 +0100, Hannu Krosing wrote:

On Tue, Jan 13, 2026 at 3:32 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <hannuk@google.com> wrote:

Added to https://commitfest.postgresql.org/patch/6219/

I see you added the "Backport" tag in the CF. This isn't the sort of
thing that we'd do that for. Was that a mistake?

Just wanted to mark it as something that might be backported later as
one of the important cases for pg_dump is dumping older databases .

I think it's obvious that nothing that's being discussed here has any business
being considered for backporting.

Do we have clear written guidelines about what can and can not be backported ?
And do we distinguish between the core database, extensions and tools for this?

Nor can I follow the argument that dumping of old databases would be an
argument, as the proposed change is in pg_dump, not the server, so a new
pg_dump will suffice to get the benefit, no?

I was not sure if pg_restore is backwards compatible enough to be able
to restore from newer pg_dump versions.

----
Hannu

#19

Nathan Bossart

nathandbossart@gmail.com

5 months ago

In reply to: Hannu Krosing (#18)

Re: Patch: dumping tables data in multiple chunks in pg_dump

On Wed, Jan 14, 2026 at 10:46:07PM +0100, Hannu Krosing wrote:

Do we have clear written guidelines about what can and can not be backported ?
And do we distinguish between the core database, extensions and tools for this?

From https://www.postgresql.org/support/versioning:

Minor releases only contain fixes for frequently-encountered bugs,
low-risk fixes, security issues, and data corruption problems.

--
nathan

#20

Hannu Krosing

hannu@tm.ee

5 months ago

In reply to: Nathan Bossart (#19)

Re: Patch: dumping tables data in multiple chunks in pg_dump

Thanks Nathan,

Is this the whole guidelines we have for this ?

Did not recognise it at once, but I assume that this implies also
things about backporting, as only way to backport something is to put
it in a minor release.

I was hoping that there were some exceptions possible for things
affecting interactions between different versions that need also
support from older versions, but if this page is all we have then
likely not.

BTW, this is also why I am not entirely happy about logical
replication being part of the core, as this implies that no bigger
improvements are possible for cases when it is used for version
upgrades. Or at least the improvements would be of no use for
upgrading existingdatabases .

Do we have the same strict no improvements in minor versions policy
for contrib/ extensions and tools?

On Wed, Jan 14, 2026 at 10:53 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:

Show quoted text

On Wed, Jan 14, 2026 at 10:46:07PM +0100, Hannu Krosing wrote:

Do we have clear written guidelines about what can and can not be backported ?
And do we distinguish between the core database, extensions and tools for this?

From https://www.postgresql.org/support/versioning:

Minor releases only contain fixes for frequently-encountered bugs,
low-risk fixes, security issues, and data corruption problems.

--
nathan