Faster data load

Started by Lok Pover 1 year ago9 messagesgeneral
Jump to latest
#1Lok P
loknath.73@gmail.com

Hi,

We are having a requirement to create approx 50 billion rows in a partition
table(~1 billion rows per partition, 200+gb size daily partitions) for a
performance test. We are currently using ' insert into <target
table_partition> select.. From <source_table_partition> or <some
transformed query>;' method . We have dropped all indexes and constraints
First and then doing the load. Still it's taking 2-3 hours to populate one
partition. Is there a faster way to achieve this?

Few teammate suggesting to use copy command and use file load instead,
which will be faster. So I wanted to understand, how different things it
does behind the scenes as compared to insert as select command? As because
it only deals with sql engine only.

Additionally, when we were trying to create indexes post data load on one
partition, it took 30+ minutes. Any possible way to make it faster?

Is there any way to drive the above things in parallel by utilizing full
database resources?

It's postgres 15.4

Regards
Lok

#2Ron
ronljohnsonjr@gmail.com
In reply to: Lok P (#1)
Re: Faster data load

On Thu, Sep 5, 2024 at 4:14 PM Lok P <loknath.73@gmail.com> wrote:

Hi,

We are having a requirement to create approx 50 billion rows in a
partition table(~1 billion rows per partition, 200+gb size daily
partitions) for a performance test. We are currently using ' insert into
<target table_partition> select.. From <source_table_partition> or <some
transformed query>;' method . We have dropped all indexes and constraints
First and then doing the load. Still it's taking 2-3 hours to populate one
partition.

At three hours, that's 92,593 records/second. Seems pretty slow.

How much of that time is taken by <some transformed query>?
How big are the records?
How fast is the hardware?

Is there a faster way to achieve this?

Testing is the only way to know for sure.

Few teammate suggesting to use copy command and use file load instead,
which will be faster. So I wanted to understand, how different things it
does behind the scenes as compared to insert as select command? As because
it only deals with sql engine only.

COPY is highly optimized for buffered operation. INSERT... maybe not so
much.

But if the source data is already in a table, that would require piping the
data to stdout and then back into the database.

psql appdb -c "COPY (SELECT ...) TO STDOUT;" | psql appdb -c "COPY
some_table FROM STDOUT;". Use binary mode, so text conversion isn't
required.

Maybe that's faster, maybe not.

Additionally, when we were trying to create indexes post data load on one

partition, it took 30+ minutes. Any possible way to make it faster?

Is there any way to drive the above things in parallel by utilizing full
database resources?

Put the destination tables in a different tablespace on a different
controller.

It's postgres 15.4

Why not 15.8?

--
Death to America, and butter sauce.
Iraq lobster!

#3Jeff Ross
jross@openvistas.net
In reply to: Lok P (#1)
Re: Faster data load

On 9/5/24 14:14, Lok P wrote:

Hi,

We are having a requirement to create approx 50 billion rows in a
partition table(~1 billion rows per partition, 200+gb size daily
partitions) for a performance test. We are currently using ' insert
into <target table_partition> select.. From <source_table_partition>
or <some transformed query>;' method . We have dropped all indexes and
constraints First and then doing the load. Still it's taking 2-3 hours
to populate one partition. Is there a faster way to achieve this?

Few teammate suggesting to use copy command and use file load instead,
which will be faster. So I wanted to understand, how different things
it does behind the scenes as compared to insert as select command? As
because it only deals with sql engine only.

Additionally, when we were trying to create indexes post data load on
one partition, it took 30+ minutes. Any possible way to make it faster?

Is there any way to drive the above things in parallel by utilizing
full database resources?

It's postgres 15.4

Regards
Lok

Try pg_bulkload to load the data--takes a little set up but it is very
fast.  Do pay attention to the caveats.  For a performance test they
probably won't be relevant.

https://github.com/ossc-db/pg_bulkload?tab=readme-ov-file

Jeff

#4Muhammad Usman Khan
usman.k@bitnine.net
In reply to: Lok P (#1)
Re: Faster data load

Hi,

You can use pg_partman. If your table is partitioned, you can manage
partitions in parallel by distributing the load across partitions
concurrently. Or you can use citus. It can be an excellent solution,
especially for handling large data volumes and parallelizing data operations

On Fri, 6 Sept 2024 at 01:14, Lok P <loknath.73@gmail.com> wrote:

Show quoted text

Hi,

We are having a requirement to create approx 50 billion rows in a
partition table(~1 billion rows per partition, 200+gb size daily
partitions) for a performance test. We are currently using ' insert into
<target table_partition> select.. From <source_table_partition> or <some
transformed query>;' method . We have dropped all indexes and constraints
First and then doing the load. Still it's taking 2-3 hours to populate one
partition. Is there a faster way to achieve this?

Few teammate suggesting to use copy command and use file load instead,
which will be faster. So I wanted to understand, how different things it
does behind the scenes as compared to insert as select command? As because
it only deals with sql engine only.

Additionally, when we were trying to create indexes post data load on one
partition, it took 30+ minutes. Any possible way to make it faster?

Is there any way to drive the above things in parallel by utilizing full
database resources?

It's postgres 15.4

Regards
Lok

#5Lok P
loknath.73@gmail.com
In reply to: Muhammad Usman Khan (#4)
Re: Faster data load

On Fri, 6 Sept, 2024, 9:20 am Muhammad Usman Khan, <usman.k@bitnine.net>
wrote:

Hi,

You can use pg_partman. If your table is partitioned, you can manage
partitions in parallel by distributing the load across partitions
concurrently. Or you can use citus. It can be an excellent solution,
especially for handling large data volumes and parallelizing data operations

Thank you.
The tables are partitioned. Also during index creation we are trying to do
it multiple partitions at same time from multiple sessions.But seeing out
of memory error in 5th or 6th session. And even each sessions taking 30mins
per partitions for index creation. Attach index partitions happening in
seconds though.

Show quoted text
#6Ron
ronljohnsonjr@gmail.com
In reply to: Lok P (#5)
Re: Faster data load

On Fri, Sep 6, 2024 at 12:43 AM Lok P <loknath.73@gmail.com> wrote:

Also during index creation we are trying to do it multiple partitions at
same time from multiple sessions.But seeing out of memory error in 5th or
6th session.

Had that same problem during pg_restore. Reduced maintenance_work_mem and
the problem went away.

--
Death to America, and butter sauce.
Iraq lobster!

#7Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Lok P (#1)
Re: Faster data load

On 2024-09-06 01:44:00 +0530, Lok P wrote:

We are having a requirement to create approx 50 billion rows in a partition
table(~1 billion rows per partition, 200+gb size daily partitions) for a
performance test. We are currently using ' insert into <target table_partition>
select.. From <source_table_partition> or <some transformed query>;' method .
We have dropped all indexes and constraints First and then doing the load.
Still it's taking 2-3 hours to populate one partition.

That seems quite slow. Is the table very wide or does it have a large
number of indexes?

Is there a faster way to achieve this? 

Few teammate suggesting to use copy command and use file load instead, which
will be faster.

I doubt that.

I benchmarked several strategies for populating tables 5 years ago and
(for my test data and on our hardware at the time - YMMV) s simple
INSERT ... SELECT was more than twice as fast as 8 parallel COPY
operations (and about 8 times as fast as a single COPY).

Details will have changed since then (I should rerun that benchmark on
a current system), but I'd be surprised if COPY became that much faster
relative to INSERT ... SELECT.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#8Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Peter J. Holzer (#7)
Re: Faster data load

On 9/8/24 10:45, Peter J. Holzer wrote:

On 2024-09-06 01:44:00 +0530, Lok P wrote:

We are having a requirement to create approx 50 billion rows in a partition
table(~1 billion rows per partition, 200+gb size daily partitions) for a
performance test. We are currently using ' insert into <target table_partition>
select.. From <source_table_partition> or <some transformed query>;' method .
We have dropped all indexes and constraints First and then doing the load.
Still it's taking 2-3 hours to populate one partition.

That seems quite slow. Is the table very wide or does it have a large
number of indexes?

Is there a faster way to achieve this?

Few teammate suggesting to use copy command and use file load instead, which
will be faster.

I doubt that.

I benchmarked several strategies for populating tables 5 years ago and
(for my test data and on our hardware at the time - YMMV) s simple
INSERT ... SELECT was more than twice as fast as 8 parallel COPY
operations (and about 8 times as fast as a single COPY). >
Details will have changed since then (I should rerun that benchmark on
a current system), but I'd be surprised if COPY became that much faster
relative to INSERT ... SELECT.

Yeah they seem to have changed a great deal. Though you are correct in
saying COPY is not faster then INSERT..SELECT

select version();
version

-----------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 16.4 (Ubuntu 16.4-1.pgdg22.04+1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit

select count(*) from nyc_taxi_duckdb ;
count
---------
2846722

ll -h nyc_taxi.csv
-rw-rw-r-- 1 aklaver aklaver 252M Sep 8 10:54 nyc_taxi.csv

insert into nyc_duplicate select * from nyc_taxi_duckdb ;
INSERT 0 2846722
Time: 7015.072 ms (00:07.015)

truncate nyc_duplicate ;

\copy nyc_duplicate from 'nyc_taxi.csv' with csv header
COPY 2846722
Time: 8760.197 ms (00:08.760)

copy nyc_duplicate from '/tmp/nyc_taxi.csv' with csv header;
COPY 2846722
Time: 7904.279 ms (00:07.904)

Just to see what the coming attraction offers:

select version();
version

--------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 17rc1 (Ubuntu 17~rc1-1.pgdg22.04+1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit

insert into nyc_duplicate select * from nyc_taxi_duckdb ;
INSERT 0 2846722
Time: 5315.878 ms (00:05.316)

\copy nyc_duplicate from 'nyc_taxi.csv' with csv header
COPY 2846722
Time: 10042.129 ms (00:10.042)

copy nyc_duplicate from '/tmp/nyc_taxi.csv' with csv header;
COPY 2846722
Time: 8422.503 ms (00:08.423)

hp

--
Adrian Klaver
adrian.klaver@aklaver.com

#9Dominique Devienne
ddevienne@gmail.com
In reply to: Adrian Klaver (#8)
Re: Faster data load

On Sun, Sep 8, 2024 at 8:27 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

simple INSERT ... SELECT was more than twice as fast as 8 parallel COPY
operations (and about 8 times as fast as a single COPY).

Yeah they seem to have changed a great deal. Though you are correct in
saying COPY is not faster then INSERT..SELECT

Interesting. Thanks for sharing. Although to be fair, that adds CSV parsing
to the mix, something INSERT-FROM-SELECT does not have to do,
skewing the results a little maybe. Comparing against COPY BINARY
would be fairer, but less practical I guess. --DD