parallel data loading for pgbench -i

Started by Mircea Cadariu3 months ago15 messageshackers

Jump to latest

Mircea Cadariu

cadariu.mircea@gmail.com

5 months ago

Hi,

I propose a patch for speeding up pgbench -i through multithreading.

To enable this, pass -j and then the number of workers you want to use.

Here are some results I got on my laptop:

master

---

-i -s 100
done in 20.95 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 14.51 s, vacuum 0.27 s, primary keys 6.16 s).

-i -s 100 --partitions=10
done in 29.73 s (drop tables 0.00 s, create tables 0.02 s, client-side
generate 16.33 s, vacuum 8.72 s, primary keys 4.67 s).

patch (-j 10)

---

-i -s 100 -j 10
done in 18.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 5.82 s, vacuum 6.89 s, primary keys 5.93 s).

-i -s 100 -j 10 --partitions=10
done in 14.66 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 8.42 s, vacuum 1.55 s, primary keys 4.68 s).

The speedup is more significant for the partitioned use-case. This is
because all workers can use COPY FREEZE (thus incurring a lower vacuum
penalty) because they create their separate partitions.

For the non-partitioned case the speedup is lower, but I observe it
improves somewhat with larger scale factors. When parallel vacuum
support is merged, this should further reduce the time.

I'd still need to update docs, tests, better integrate the code with its
surroundings, and other aspects. Would appreciate any feedback on what I
have so far though. Thanks!

Kind regards,

Mircea Cadariu

lakshmi

lakshmigcdac@gmail.com

3 months ago

In reply to: Mircea Cadariu (#1)

Re: parallel data loading for pgbench -i

Hi Mircea,

I tested the patch on 19devel and it worked well for me.
Before applying it, -j is rejected in pgbench initialization mode as
expected. After applying the patch, pgbench -i -s 100 -j 10 runs
successfully and shows a clear speedup.
On my system the total runtime dropped to about 9.6s, with client-side data
generation around 3.3s.
I also checked correctness after the run — row counts for pgbench_accounts,
pgbench_branches, and pgbench_tellers all match the expected values.

Thanks for working on this, the improvement is very noticeable.

Best regards,
lakshmi

On Mon, Jan 19, 2026 at 2:30 PM Mircea Cadariu <cadariu.mircea@gmail.com>
wrote:

Show quoted text

Hi,

I propose a patch for speeding up pgbench -i through multithreading.

To enable this, pass -j and then the number of workers you want to use.

Here are some results I got on my laptop:

master

---

-i -s 100
done in 20.95 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 14.51 s, vacuum 0.27 s, primary keys 6.16 s).

-i -s 100 --partitions=10
done in 29.73 s (drop tables 0.00 s, create tables 0.02 s, client-side
generate 16.33 s, vacuum 8.72 s, primary keys 4.67 s).

patch (-j 10)

---

-i -s 100 -j 10
done in 18.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 5.82 s, vacuum 6.89 s, primary keys 5.93 s).

-i -s 100 -j 10 --partitions=10
done in 14.66 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 8.42 s, vacuum 1.55 s, primary keys 4.68 s).

The speedup is more significant for the partitioned use-case. This is
because all workers can use COPY FREEZE (thus incurring a lower vacuum
penalty) because they create their separate partitions.

For the non-partitioned case the speedup is lower, but I observe it
improves somewhat with larger scale factors. When parallel vacuum
support is merged, this should further reduce the time.

I'd still need to update docs, tests, better integrate the code with its
surroundings, and other aspects. Would appreciate any feedback on what I
have so far though. Thanks!

Kind regards,

Mircea Cadariu

Mircea Cadariu

cadariu.mircea@gmail.com

3 months ago

In reply to: lakshmi (#2)

Re: parallel data loading for pgbench -i

Hi Lakshmi,

On 19/01/2026 09:25, lakshmi wrote:

Hi Mircea,

I tested the patch on 19devel and it worked well for me.
Before applying it, |-j| is rejected in pgbench initialization mode as
expected. After applying the patch, |pgbench -i -s 100 -j 10| runs
successfully and shows a clear speedup.
On my system the total runtime dropped to about 9.6s, with client-side
data generation around 3.3s.
I also checked correctness after the run — row counts for
pgbench_accounts, pgbench_branches, and pgbench_tellers all match the
expected values.

Thanks for working on this, the improvement is very noticeable.

Best regards,
lakshmi

Thanks for having a look and trying it out!

FYI this is one of Tomas Vondra's patch ideas from his blog [1]https://vondra.me/posts/patch-idea-parallel-pgbench-i.

I have attached a new version which now includes docs, tests, a proposed
commit message, and an attempt to fix the current CI failures (Windows).

[1]: https://vondra.me/posts/patch-idea-parallel-pgbench-i

--
Thanks,
Mircea Cadariu

lakshmi

lakshmigcdac@gmail.com

2 months ago

In reply to: Mircea Cadariu (#3)

Re: parallel data loading for pgbench -i

Hi Mircea,

Thanks again for the updated patch.
I did some additional testing on 19devel with a larger scale factor.
For scale 100,parallel initialization with -j 10 shows a clear overall
speedup and correct results ,as mentioned earlier.
For scale 500,i observed that client-side data generation becomes
significantly faster with parallel loading,but the total run time was
slightly higher than the serial case on my system.This appears to be mainly
due to much longer vacuum phase after the parallel load.
so the parallel approach clearly improves data generation time,but the
overall benefit may depend on scale and workload characteristics.
Regression tests still pass locally,and correctness checks look good.

just sharing these observations in case they are useful for further
evaluation.

Best regards,
lakshmi

On Thu, Jan 29, 2026 at 4:49 PM Mircea Cadariu <cadariu.mircea@gmail.com>
wrote:

Show quoted text

Hi Lakshmi,
On 19/01/2026 09:25, lakshmi wrote:

Hi Mircea,

I tested the patch on 19devel and it worked well for me.
Before applying it, -j is rejected in pgbench initialization mode as
expected. After applying the patch, pgbench -i -s 100 -j 10 runs
successfully and shows a clear speedup.
On my system the total runtime dropped to about 9.6s, with client-side
data generation around 3.3s.
I also checked correctness after the run — row counts for
pgbench_accounts, pgbench_branches, and pgbench_tellers all match the
expected values.

Thanks for working on this, the improvement is very noticeable.

Best regards,
lakshmi

Thanks for having a look and trying it out!

FYI this is one of Tomas Vondra's patch ideas from his blog [1].

I have attached a new version which now includes docs, tests, a proposed
commit message, and an attempt to fix the current CI failures (Windows).

[1] - https://vondra.me/posts/patch-idea-parallel-pgbench-i

--
Thanks,
Mircea Cadariu

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

2 months ago

In reply to: Mircea Cadariu (#3)

RE: parallel data loading for pgbench -i

Dear Mircea,

Thanks for the proposal. I also feel the initalization wastes time.
Here are my initial comments.

01.
I found that pgbench raises a FATAL in case of -j > --partitions, is there a
specific reason?
If needed, we may choose the softer way, which adjust nthreads up to the number
of partitions. -c and -j do the similar one:

```
if (nthreads > nclients && !is_init_mode)
nthreads = nclients;
```

02.
Also, why is -j accepted in case of non-partitions?

03.
Can we port all validation to main()? I found initPopulateTableParallel() has
such a part.

04.
Copying seems to be divided into chunks per COPY_BATCH_SIZE. Is it really
essential to parallelize the initialization? I feel it may optimize even
serialized case thus can be discussed independently.

05.
Per my understanding, each thread creates its tables, and all of them are
attached to the parent table. Is it right? I think it needs more code
changes, and I am not sure it is critical to make initialization faster.

So I suggest using the incremental approach. The first patch only parallelizes
the data load, and the second patch implements the CREATE TABLE and ALTER TABLE
ATTACH PARTITION. You can benchmark three patterns, master, 0001, and
0001 + 0002, then compare the results. IIUC, this is the common approach to
reduce the patch size and make them more reviewable.

06.
Missing update for typedefs.list. WorkerTask and CopyTarget can be added there.

07.
Since there is a report like [1]/messages/by-id/CAEvyyTht69zjnosPjziW6dqNLqs-n6eKia2vof108zQp1QFX=Q@mail.gmail.com, you can benchmark more cases.

[1]: /messages/by-id/CAEvyyTht69zjnosPjziW6dqNLqs-n6eKia2vof108zQp1QFX=Q@mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

lakshmi

lakshmigcdac@gmail.com

about 2 months ago

In reply to: Hayato Kuroda (Fujitsu) (#5)

Re: parallel data loading for pgbench -i

Hi Mircea, Hayato,
I ran a few more tests on 19devel ,focusing on the partitioned case to
better understand the performance behavior.

For scale 500, the serial initialization on my system takes around 34.3
seconds. Using parallel initialization without partitions (-j 10) makes the
client-side data generation noticeably faster,But the overall runtime ends
up slightly higher because the vacuum phase becomes much longer.

However,when running with partitions(pgbench -i -s 500 --partitions=10 -j
10),the total runtime drops to about 21.9 seconds, and the vacuum cost is
much smaller.I also verified that the row counts are correct in all cases
,and regression tests still pass locally.

So it looks like the main benefit of parallel initialization shows up
clearly in the partitioned setup,which matches the expectations discussed
earlier.Just sharing these observations in case they are useful for the
ongoing review.
Thanks again for the work on this patch.

Best regards,
Lakshmi

On Wed, Feb 11, 2026 at 5:53 PM Hayato Kuroda (Fujitsu) <
kuroda.hayato@fujitsu.com> wrote:

Show quoted text

Dear Mircea,

Thanks for the proposal. I also feel the initalization wastes time.
Here are my initial comments.

01.
I found that pgbench raises a FATAL in case of -j > --partitions, is there
a
specific reason?
If needed, we may choose the softer way, which adjust nthreads up to the
number
of partitions. -c and -j do the similar one:

```
if (nthreads > nclients && !is_init_mode)
nthreads = nclients;
```

02.
Also, why is -j accepted in case of non-partitions?

03.
Can we port all validation to main()? I found initPopulateTableParallel()
has
such a part.

04.
Copying seems to be divided into chunks per COPY_BATCH_SIZE. Is it really
essential to parallelize the initialization? I feel it may optimize even
serialized case thus can be discussed independently.

05.
Per my understanding, each thread creates its tables, and all of them are
attached to the parent table. Is it right? I think it needs more code
changes, and I am not sure it is critical to make initialization faster.

So I suggest using the incremental approach. The first patch only
parallelizes
the data load, and the second patch implements the CREATE TABLE and ALTER
TABLE
ATTACH PARTITION. You can benchmark three patterns, master, 0001, and
0001 + 0002, then compare the results. IIUC, this is the common approach to
reduce the patch size and make them more reviewable.

06.
Missing update for typedefs.list. WorkerTask and CopyTarget can be added
there.

07.
Since there is a report like [1], you can benchmark more cases.

[1]:
/messages/by-id/CAEvyyTht69zjnosPjziW6dqNLqs-n6eKia2vof108zQp1QFX=Q@mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

about 2 months ago

In reply to: lakshmi (#6)

RE: parallel data loading for pgbench -i

Dear Iakshmi,

Thanks for the measurement!

For scale 500, the serial initialization on my system takes around 34.3 seconds.
Using parallel initialization without partitions (-j 10) makes the client-side
data generation noticeably faster,But the overall runtime ends up slightly
higher because the vacuum phase becomes much longer.

To confirm, do you know the reason why the VACUUMing needs more time than serial case?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

lakshmi

lakshmigcdac@gmail.com

about 2 months ago

In reply to: Hayato Kuroda (Fujitsu) (#7)

Re: parallel data loading for pgbench -i

On Fri, Feb 20, 2026 at 3:29 PM Hayato Kuroda (Fujitsu) <
kuroda.hayato@fujitsu.com> wrote:

Dear Iakshmi,

Thanks for the measurement!

For scale 500, the serial initialization on my system takes around 34.3

seconds.

Using parallel initialization without partitions (-j 10) makes the

client-side

data generation noticeably faster,But the overall runtime ends up

slightly

higher because the vacuum phase becomes much longer.

To confirm, do you know the reason why the VACUUMing needs more time than
serial case?

Dear Hayato,

Thank you for the question.

From what I observed,in the non-partitioned parallel case the data
generation phase becomes much faster,but the VACUUM phase takes longer
compared to the serial run.

My current understanding is that this may be related to multiple workers
inserting into the same heap relation.That could potentially affect page
locality or increases the amount of freezing work required afterward.In
contrast,the partitioned case seems to benefit more clearly,likely because
each worker operates on a separate partition and COPY FREEZE reduces the
vacuum effort.

I have not yet done deeper internal analysis,so this is based on the
behavior I measured rather than detailed inspection.If needed,I can try to
collect additional statistics to better understand and difference.

please let me know if this reasoning aligns with your understanding.

Best regards
Lakshmi

Show quoted text

Mircea Cadariu

cadariu.mircea@gmail.com

about 1 month ago

In reply to: lakshmi (#8)

Re: parallel data loading for pgbench -i

Hi Lakshmi, Hayato,

Thanks a lot for your input!

I'm not sure why the VACUUM phase takes longer compared to the serial
run. We can potentially get a clue with a profiler. I know there is an
ongoing effort to introduce parallel heap vacuum [1]/messages/by-id/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com which I expect will
help with this.

The code comments you have provided me have been applied to the v2 patch
attached. Below I provide answers to the questions.

Also, why is -j accepted in case of non-partitions?

For non-partitioned tables, each worker loads a separate range of rows
via its own connection in parallel.

Copying seems to be divided into chunks per COPY_BATCH_SIZE. Is it really
essential to parallelize the initialization? I feel it may optimize even
serialized case thus can be discussed independently.

You're right that the COPY batching is an optimization that's
independent. I wanted to see how fast I can get this patch, so I looked
for bottlenecks in the new code with a profiler and this was one of
them. I agree it makes sense to apply this for the serialised case
separately.

Per my understanding, each thread creates its tables, and all of them are
attached to the parent table. Is it right? I think it needs more code
changes, and I am not sure it is critical to make initialization faster.

Yes, that's correct. Each worker creates its assigned partitions as
standalone tables, loads data into them, and then the main thread
attaches them all to the parent after loading completes. It's to avoid
AccessExclusiveLock contention on the parent table during parallel
loading and allow each worker to use COPY FREEZE on its standalone table.

So I suggest using the incremental approach. The first patch only
parallelizes
the data load, and the second patch implements the CREATE TABLE and
ALTER TABLE
ATTACH PARTITION. You can benchmark three patterns, master, 0001, and
0001 + 0002, then compare the results. IIUC, this is the common
approach to
reduce the patch size and make them more reviewable.

Thanks for the recommendation, I extracted 0001 and 0002 as per your
suggestion. I will see if I can split it more, as indeed it helps with
the review.

Results are similar with the previous runs.

master

pgbench -i -s 100 -j 10
done in 20.95 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 14.51 s, vacuum 0.27 s, primary keys 6.16 s).

pgbench -i -s 100 -j 10 --partitions=10
done in 29.73 s (drop tables 0.00 s, create tables 0.02 s, client-side
generate 16.33 s, vacuum 8.72 s, primary keys 4.67 s).

0001
pgbench -i -s 100 -j 10
done in 18.75 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 6.51 s, vacuum 5.73 s, primary keys 6.50 s).

pgbench -i -s 100 -j 10 --partitions=10
done in 29.33 s (drop tables 0.00 s, create tables 0.02 s, client-side
generate 16.48 s, vacuum 7.59 s, primary keys 5.24 s).

0002
pgbench -i -s 100 -j 10
done in 18.12 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 6.64 s, vacuum 5.81 s, primary keys 5.65 s).

pgbench -i -s 100 -j 10 --partitions=10
done in 14.38 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 7.97 s, vacuum 1.55 s, primary keys 4.85 s).

Looking forward to your feedback.

[1]: /messages/by-id/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com
/messages/by-id/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com

--
Thanks,
Mircea Cadariu

#10

lakshmi

lakshmigcdac@gmail.com

29 days ago

In reply to: Mircea Cadariu (#9)

Re: parallel data loading for pgbench -i

Hi Mircea, Hayato,

Thanks for the updated v2 patches.

I applied 0001 and 0002 on 19devel and ran some tests. The results look
consistent.

For scale 100, parallel loading speeds up data generation, but in the
non-partitioned case, the VACUUM phase becomes noticeably slower. In
contrast, the partitioned + parallel case performs best overall with much
lower vacuum cost.

For scale 500, I see the same pattern: non-partitioned parallel runs are
dominated by VACUUM time, while the partitioned setup shows a clear overall
speedup.

I also verified correctness, and row counts match expected values.

So overall, the benefit of parallel loading is much clearer in the
partitioned case.

I’ll try to look further into the VACUUM behavior.

Thanks again for the work on this.

Best regards,
Lakshmi

On Fri, Mar 13, 2026 at 11:59 PM Mircea Cadariu <cadariu.mircea@gmail.com>
wrote:

Show quoted text

Hi Lakshmi, Hayato,

Thanks a lot for your input!

I'm not sure why the VACUUM phase takes longer compared to the serial
run. We can potentially get a clue with a profiler. I know there is an
ongoing effort to introduce parallel heap vacuum [1] which I expect will
help with this.

The code comments you have provided me have been applied to the v2 patch
attached. Below I provide answers to the questions.

Also, why is -j accepted in case of non-partitions?

For non-partitioned tables, each worker loads a separate range of rows
via its own connection in parallel.

Copying seems to be divided into chunks per COPY_BATCH_SIZE. Is it really
essential to parallelize the initialization? I feel it may optimize even
serialized case thus can be discussed independently.

You're right that the COPY batching is an optimization that's
independent. I wanted to see how fast I can get this patch, so I looked
for bottlenecks in the new code with a profiler and this was one of
them. I agree it makes sense to apply this for the serialised case
separately.

Per my understanding, each thread creates its tables, and all of them are
attached to the parent table. Is it right? I think it needs more code
changes, and I am not sure it is critical to make initialization faster.

Yes, that's correct. Each worker creates its assigned partitions as
standalone tables, loads data into them, and then the main thread
attaches them all to the parent after loading completes. It's to avoid
AccessExclusiveLock contention on the parent table during parallel
loading and allow each worker to use COPY FREEZE on its standalone table.

So I suggest using the incremental approach. The first patch only
parallelizes
the data load, and the second patch implements the CREATE TABLE and
ALTER TABLE
ATTACH PARTITION. You can benchmark three patterns, master, 0001, and
0001 + 0002, then compare the results. IIUC, this is the common
approach to
reduce the patch size and make them more reviewable.

Thanks for the recommendation, I extracted 0001 and 0002 as per your
suggestion. I will see if I can split it more, as indeed it helps with
the review.

Results are similar with the previous runs.

master

pgbench -i -s 100 -j 10
done in 20.95 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 14.51 s, vacuum 0.27 s, primary keys 6.16 s).

pgbench -i -s 100 -j 10 --partitions=10
done in 29.73 s (drop tables 0.00 s, create tables 0.02 s, client-side
generate 16.33 s, vacuum 8.72 s, primary keys 4.67 s).

0001
pgbench -i -s 100 -j 10
done in 18.75 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 6.51 s, vacuum 5.73 s, primary keys 6.50 s).

pgbench -i -s 100 -j 10 --partitions=10
done in 29.33 s (drop tables 0.00 s, create tables 0.02 s, client-side
generate 16.48 s, vacuum 7.59 s, primary keys 5.24 s).

0002
pgbench -i -s 100 -j 10
done in 18.12 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 6.64 s, vacuum 5.81 s, primary keys 5.65 s).

pgbench -i -s 100 -j 10 --partitions=10
done in 14.38 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 7.97 s, vacuum 1.55 s, primary keys 4.85 s).

Looking forward to your feedback.

[1]:

/messages/by-id/CAD21AoAEfCNv-GgaDheDJ+s-p_Lv1H24AiJeNoPGCmZNSwL1YA@mail.gmail.com

--
Thanks,
Mircea Cadariu

#11

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

9 days ago

In reply to: lakshmi (#10)

Re: parallel data loading for pgbench -i

On 18/03/2026 12:37, lakshmi wrote:

So overall, the benefit of parallel loading is much clearer in the
partitioned case.

I’ll try to look further into the VACUUM behavior.

As discussed already, the slower VACUUM is surely because of the lack of
COPY FREEZE option. That's unfortunate...

The way this patch uses the connections and workers is a little bonkers.
The main thread uses the first connection to execute:

begin; TRUNCATE TABLE pgbench_accounts;

That connection is handed over to the first worker thread, and new
connections are opened for the other workers. But thanks to the
TRUNCATE, the open transaction on the first connection holds an
AccessExclusiveLock, preventing the other workers from starting the COPY
until the first worker has finished! I added some debugging prints to
show this:

$ pgbench -s500 -i -j10 postgres
dropping old tables...
creating tables...
generating data (client-side)...
loading pgbench_accounts with 10 threads...
0.00: thread 0: sending COPY command, use_freeze: 1
0.00: thread 1: sending COPY command, use_freeze: 0
0.00: thread 2: sending COPY command, use_freeze: 0
0.00: thread 0: COPY started for rows between 0 and 5000000
0.00: thread 6: sending COPY command, use_freeze: 0
0.00: thread 3: sending COPY command, use_freeze: 0
0.00: thread 9: sending COPY command, use_freeze: 0
0.00: thread 4: sending COPY command, use_freeze: 0
0.00: thread 5: sending COPY command, use_freeze: 0
0.00: thread 7: sending COPY command, use_freeze: 0
0.00: thread 8: sending COPY command, use_freeze: 0
6.19: thread 0: COPY done!
6.27: thread 9: COPY started for rows between 45000000 and 50000000
6.27: thread 1: COPY started for rows between 5000000 and 10000000
6.27: thread 5: COPY started for rows between 25000000 and 30000000
6.27: thread 2: COPY started for rows between 10000000 and 15000000
6.27: thread 6: COPY started for rows between 30000000 and 35000000
6.27: thread 3: COPY started for rows between 15000000 and 20000000
6.27: thread 8: COPY started for rows between 40000000 and 45000000
6.27: thread 4: COPY started for rows between 20000000 and 25000000
6.27: thread 7: COPY started for rows between 35000000 and 40000000
19.19: thread 1: COPY done!
19.21: thread 9: COPY done!
19.26: thread 6: COPY done!
19.27: thread 7: COPY done!
19.28: thread 3: COPY done!
19.28: thread 5: COPY done!
19.28: thread 4: COPY done!
19.29: thread 8: COPY done!
19.36: thread 2: COPY done!
vacuuming...
creating primary keys...
done in 71.58 s (drop tables 0.07 s, create tables 0.01 s, client-side
generate 19.41 s, vacuum 26.50 s, primary keys 25.59 s).

The straightforward fix is to commit the TRUNCATE transaction, and not
use FREEZE on any of the COPY commands.

This all makes more sense in the partitioned case. Perhaps we should
parallelize only when partitioned are used, and use only one thread per
partition.

- Heikki

#12

Mircea Cadariu

cadariu.mircea@gmail.com

6 days ago

In reply to: Heikki Linnakangas (#11)

Re: parallel data loading for pgbench -i

Hi,

On 07/04/2026 10:00, Heikki Linnakangas wrote:

This all makes more sense in the partitioned case. Perhaps we should
parallelize only when partitioned are used, and use only one thread
per partition.

Thanks for having a look. I attached v3 that parallelizes only the
partitioned case, one thread per partition. Results:

patch:

pgbench -i -s 100 --partitions 10

done in 12.63 s (drop tables 0.05 s, create tables 0.01 s, client-side
generate 5.98 s, vacuum 1.63 s, primary keys 4.96 s).

master:

pgbench -i -s 100 --partitions 10

done in 29.29 s (drop tables 0.00 s, create tables 0.02 s, client-side
generate 16.31 s, vacuum 7.78 s, primary keys 5.18 s).

--
Thanks,
Mircea Cadariu

#13

lakshmi

lakshmigcdac@gmail.com

3 days ago

In reply to: Mircea Cadariu (#12)

Re: parallel data loading for pgbench -i

Hi Mircea, Heikki,

I tested the v3 patch on 19devel with larger scale factors.

The behavior looks much better now compared to the earlier versions. For
scale 100 and 500, I see clear improvements in overall runtime, and for
scale 2000, the total time is around 97s on my system.

The loading phase now runs concurrently across workers, and I don’t see the
earlier serialization behavior anymore.

The VACUUM phase also remains relatively small (~6s for scale 2000), which
suggests that the previous overhead has been addressed.

I also verified correctness, and the row counts match the expected values.

Overall, the partitioned parallel approach looks solid and scales well in
my tests.

Thanks again for the work on this.

Best regards,
Lakshmi

On Sat, Apr 11, 2026 at 12:07 AM Mircea Cadariu <cadariu.mircea@gmail.com>
wrote:

Show quoted text

Hi,

On 07/04/2026 10:00, Heikki Linnakangas wrote:

This all makes more sense in the partitioned case. Perhaps we should
parallelize only when partitioned are used, and use only one thread
per partition.

Thanks for having a look. I attached v3 that parallelizes only the
partitioned case, one thread per partition. Results:

patch:

pgbench -i -s 100 --partitions 10

done in 12.63 s (drop tables 0.05 s, create tables 0.01 s, client-side
generate 5.98 s, vacuum 1.63 s, primary keys 4.96 s).

master:

pgbench -i -s 100 --partitions 10

done in 29.29 s (drop tables 0.00 s, create tables 0.02 s, client-side
generate 16.31 s, vacuum 7.78 s, primary keys 5.18 s).

--
Thanks,
Mircea Cadariu

#14

Hayato Kuroda (Fujitsu)

kuroda.hayato@fujitsu.com

3 days ago

In reply to: Mircea Cadariu (#12)

RE: parallel data loading for pgbench -i

Dear Mircea,

Thanks for updating the patch. Now each worker looks like not to create each
child tables, just run TRUNCATE and COPY. But I'm unclear why the TRUNCATE is
needed here. Isn't they truncated in initGenerateDataClientSide()->initTruncateTables()
before launching threads?
Also, the current API is questionable. E.g., we cannot work in series if --partition is
specified. And I'm afraid OOM failure may be more likely to happen if the table has
many partitions.
Is it possible that we can have -p again for the initialization? We can require
partitions >= nthreads or partitions % nthreads == 0 at that time.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

#15

lakshmi

lakshmigcdac@gmail.com

3 days ago

In reply to: Hayato Kuroda (Fujitsu) (#14)

Re: parallel data loading for pgbench -i

Hi Hayato,

Thanks for your feedback.

I tried a few runs with different partition counts. From what I saw,
performance doesn’t always improve with more partitions—in fact, higher
partition counts increase VACUUM time and slow things down.

I also agree that having control over the number of workers (like using -j)
would help balance this better.

Regarding TRUNCATE, I noticed it’s already done earlier, so it might be
worth checking if the extra TRUNCATE is needed.

I didn’t see memory issues in my tests, but I understand it could become a
concern with many partitions.

Thanks again for the suggestions.

Best regards,
Lakshmi

On Mon, Apr 13, 2026 at 12:53 PM Hayato Kuroda (Fujitsu) <
kuroda.hayato@fujitsu.com> wrote:

Show quoted text

Dear Mircea,

Thanks for updating the patch. Now each worker looks like not to create
each
child tables, just run TRUNCATE and COPY. But I'm unclear why the TRUNCATE
is
needed here. Isn't they truncated in
initGenerateDataClientSide()->initTruncateTables()
before launching threads?
Also, the current API is questionable. E.g., we cannot work in series if
--partition is
specified. And I'm afraid OOM failure may be more likely to happen if the
table has
many partitions.
Is it possible that we can have -p again for the initialization? We can
require
partitions >= nthreads or partitions % nthreads == 0 at that time.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

parallel data loading for pgbench -i

Attachments:

Attachments:

Attachments:

Attachments: