Multiple COPY on the same table

Started by Ravi Krishnaover 7 years ago10 messagesgeneral
Jump to latest
#1Ravi Krishna
sravikrishna@aol.com

Can I split a large file into multiple files and then run copy using each file.  The table does not contain any serial or sequence column which may need serialization. Let us say I split a large file to 4 files.  Will theperformance boost by close to 4x??
ps: Pls ignore my previous post which was without a subject (due to mistake)

#2Nicolas Paris
nicolas.paris@riseup.net
In reply to: Ravi Krishna (#1)
Re: Multiple COPY on the same table

Can I split a large file into multiple files and then run copy using
each file.

AFAIK, copy command locks the table[1]https://grokbase.com/t/postgresql/pgsql-general/01597pv3qs/copy-locking while there is no mention of this
in the documentation[2]https://www.postgresql.org/docs/current/static/sql-copy.html.

Will the performance boost by close to 4x??

You might be interested in the pbBulkInsert tool[3]https://github.com/bytefish/PgBulkInsert that allows parallel
copy with some succes accordingly to benchmarks. However, that tool does
not handle multiline csv. Because of that limitation I have been using
the standard copy command with binary format with some succes.

[1]: https://grokbase.com/t/postgresql/pgsql-general/01597pv3qs/copy-locking
[2]: https://www.postgresql.org/docs/current/static/sql-copy.html
[3]: https://github.com/bytefish/PgBulkInsert

#3Vijaykumar Jain
vjain@opentable.com
In reply to: Ravi Krishna (#1)
Re: [External] Multiple COPY on the same table

Hey Ravi,

What is the goal you are trying to achieve here.
To make pgdump/restore faster?
To make replication faster?
To make backup faster ?

Also no matter how small you split the files into, if network is your bottleneck then I am not sure you can attain n times the benefit my simply sending the files in parallel but yeah maybe some benefit.
But then for parallel processing you also need to ensure your server is having relevant resources or else it will just be a lot of context switching I guess ?
Pg dump has an option to dump in parallel
pgbasebackup is single threaded I read but pgbackrest can allow better parallel processing in backups.
There is also logical replication where you can selectively replicate your tables to avoid bandwidth issues.
I might have said a lot and nothing may be relevant, but you need to let us know the goal you want to achieve :)

Regards,
Vijay
________________________________
From: Ravi Krishna <sravikrishna@aol.com>
Sent: Monday, August 20, 2018 8:24:35 PM
To: pgsql-general@lists.postgresql.org
Subject: [External] Multiple COPY on the same table

Can I split a large file into multiple files and then run copy using each file. The table does not contain any
serial or sequence column which may need serialization. Let us say I split a large file to 4 files. Will the
performance boost by close to 4x??

ps: Pls ignore my previous post which was without a subject (due to mistake)

#4Ravi Krishna
sravikrishna@aol.com
In reply to: Vijaykumar Jain (#3)
Re: [External] Multiple COPY on the same table

What is the goal you are trying to achieve here.
To make pgdump/restore faster?
To make replication faster?
To make backup faster ?

None of the above.
 We got csv files from external vendor which are 880GB in total size, in 44 files.  Some of the large tables had COPY running for several hours. I was just thinking of a faster way to load.

#5Chris Browne
cbbrowne@acm.org
In reply to: Ravi Krishna (#4)
Re: [External] Multiple COPY on the same table

On Mon, 20 Aug 2018 at 12:53, Ravi Krishna <sravikrishna@aol.com> wrote:

What is the goal you are trying to achieve here.
To make pgdump/restore faster?
To make replication faster?
To make backup faster ?

None of the above.

We got csv files from external vendor which are 880GB in total size, in 44 files. Some of the large tables had COPY running for several hours. I was just thinking of a faster way to load.

Seems like #4...

#4 - To Make Recovery faster

Using COPY pretty much *is* the "faster way to load"...

The main thing you should consider doing to make it faster is to drop
indexes and foreign keys from the tables, and recreate them
afterwards.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

#6Vijaykumar Jain
vjain@opentable.com
In reply to: Chris Browne (#5)
Re: [External] Multiple COPY on the same table

I guess this should help you, Ravi.

https://www.postgresql.org/docs/10/static/populate.html

On 8/20/18, 10:30 PM, "Christopher Browne" <cbbrowne@gmail.com> wrote:

On Mon, 20 Aug 2018 at 12:53, Ravi Krishna <sravikrishna@aol.com> wrote:

What is the goal you are trying to achieve here.
To make pgdump/restore faster?
To make replication faster?
To make backup faster ?

None of the above.

We got csv files from external vendor which are 880GB in total size, in 44 files. Some of the large tables had COPY running for several hours. I was just thinking of a faster way to load.

Seems like #4...

#4 - To Make Recovery faster

Using COPY pretty much *is* the "faster way to load"...

The main thing you should consider doing to make it faster is to drop
indexes and foreign keys from the tables, and recreate them
afterwards.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

#7Ron
ronljohnsonjr@gmail.com
In reply to: Vijaykumar Jain (#3)
Re: [External] Multiple COPY on the same table

Maybe he just has a large file that needs to be loaded into a table...

On 08/20/2018 11:47 AM, Vijaykumar Jain wrote:

Hey Ravi,

What is the goal you are trying to achieve here.
To make pgdump/restore faster?
To make replication faster?
To make backup faster ?

Also no matter how small you split the files into, if network is your
bottleneck then I am not sure you can attain n times the benefit my simply
sending the files in parallel but yeah maybe some benefit.
But then for parallel processing you also need to ensure your server is
having relevant resources or else it will just be a lot of context
switching I guess ?
Pg dump has an option to dump in parallel
pgbasebackup is single threaded I read but pgbackrest can allow better
parallel processing in backups.
There is also logical replication where you can selectively replicate your
tables to avoid bandwidth issues.
I might have said a lot and nothing may be relevant, but you need to let
us know the goal you want to achieve :)

Regards,
Vijay
----------------------------------------------------------------------------
*From:* Ravi Krishna <sravikrishna@aol.com>
*Sent:* Monday, August 20, 2018 8:24:35 PM
*To:* pgsql-general@lists.postgresql.org
*Subject:* [External] Multiple COPY on the same table
Can I split a large file into multiple files and then run copy using each
file.  The table does not contain any
serial or sequence column which may need serialization. Let us say I split
a large file to 4 files.  Will the
performance boost by close to 4x??

ps: Pls ignore my previous post which was without a subject (due to mistake)

--
Angular momentum makes the world go 'round.

#8Ravi Krishna
sravikrishna@aol.com
In reply to: Ron (#7)
Re: [External] Multiple COPY on the same table

1. The tables has no indexes at the time of load.2.  The create table and copy are in the same transaction.
So I guess that's pretty much it.  I understand the long time it takes as some of the tables have 400+ million rows.Also the env is a container and since this is currently a POC system , not much time has been invested in fine tuning it.
thanks all.

#9Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Nicolas Paris (#2)
Re: Multiple COPY on the same table

On 08/20/2018 08:56 AM, Nicolas Paris wrote:

Can I split a large file into multiple files and then run copy using
each file.

AFAIK, copy command locks the table[1] while there is no mention of this
in the documentation[2].

[1]: Is from Postgres 7.1(17 years ago). I suspect the conditions have changed at least a little:).
changed at least a little:).

Will the performance boost by close to 4x??

You might be interested in the pbBulkInsert tool[3] that allows parallel
copy with some succes accordingly to benchmarks. However, that tool does
not handle multiline csv. Because of that limitation I have been using
the standard copy command with binary format with some succes.

[1] https://grokbase.com/t/postgresql/pgsql-general/01597pv3qs/copy-locking
[2] https://www.postgresql.org/docs/current/static/sql-copy.html
[3] https://github.com/bytefish/PgBulkInsert

--
Adrian Klaver
adrian.klaver@aklaver.com

#10Chris Browne
cbbrowne@acm.org
In reply to: Adrian Klaver (#9)
Re: Multiple COPY on the same table

On Mon, 20 Aug 2018 at 16:23, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 08/20/2018 08:56 AM, Nicolas Paris wrote:

Can I split a large file into multiple files and then run copy using
each file.

AFAIK, copy command locks the table[1] while there is no mention of this
in the documentation[2].

[1] Is from Postgres 7.1(17 years ago). I suspect the conditions have
changed at least a little:).

oxrsdb-generated@localhost-> create temp table foo ();
CREATE TABLE
oxrsdb-generated@localhost-> begin;
BEGIN
oxrsdb-generated@localhost-> * \copy foo from '/dev/null';
COPY 0
oxrsdb-generated@localhost-> * select oid, relname from pg_class where
relname = 'foo';
oid | relname
--------+---------
350686 | foo
(1 row)

oxrsdb-generated@localhost-> * select * from pg_locks ;
locktype | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction | pid
| mode | granted | fastpath
------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+------------------+---------+----------
relation | 345742 | 11713 | | | |
| | | | 2/200573 | 16754 |
AccessShareLock | t | t
relation | 345742 | 3455 | | | |
| | | | 2/200573 | 16754 |
AccessShareLock | t | t
relation | 345742 | 2663 | | | |
| | | | 2/200573 | 16754 |
AccessShareLock | t | t
relation | 345742 | 2662 | | | |
| | | | 2/200573 | 16754 |
AccessShareLock | t | t
relation | 345742 | 2685 | | | |
| | | | 2/200573 | 16754 |
AccessShareLock | t | t
relation | 345742 | 2684 | | | |
| | | | 2/200573 | 16754 |
AccessShareLock | t | t
relation | 345742 | 2615 | | | |
| | | | 2/200573 | 16754 |
AccessShareLock | t | t
relation | 345742 | 1259 | | | |
| | | | 2/200573 | 16754 |
AccessShareLock | t | t
relation | 345742 | 350686 | | | |
| | | | 2/200573 | 16754 |
RowExclusiveLock | t | t
virtualxid | | | | | 2/200573 |
| | | | 2/200573 | 16754 |
ExclusiveLock | t | t
(10 rows)

Table 'foo' has a RowExclusiveLock lock taken out as a consequence of
running COPY against it.

But that does not prevent other connections from concurrently writing
to the table.

Not all locks block other locks...
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"