CSV-bulk import and defaults

Started by Thomas Schmidtover 15 years ago5 messagesgeneral
Jump to latest
#1Thomas Schmidt
postgres@stephan.homeunix.net

Hello,

well, I'm new to postgres and this is my post on this list :-)
Anyway, I've to batch-import bulk-csv data into a staging database (as
part of an ETL-"like" pocess). The data ought to be read via STDIN,
however for keeping in simple and stupid, saving it to a file and
importing afterwards is also an option. Sticking my nose into the docs,
I noticed that copy[1]http://www.postgresql.org/docs/9.0/static/sql-copy.html as well as pg_import[2]http://pgbulkload.projects.postgresql.org/pg_bulkload.html are able to do it.

However, there are some additional columns of the staging table (job id,
etc.) that have to be set in order to identify imported rows. These
attributes are not part of the data coming from STDIN (since its
meta-data) and I see no way for specifying default values for "missing"
cvs columns. (imho copy and pg_bulkload will use table defaults for
missing rows - do I miss something?).

Thus - do you have any clue on designing an fast bulk-import for staging
data?

Thanks in advance,
Thomas

[1]: http://www.postgresql.org/docs/9.0/static/sql-copy.html
[2]: http://pgbulkload.projects.postgresql.org/pg_bulkload.html

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Thomas Schmidt (#1)
Re: CSV-bulk import and defaults

On Sunday 02 January 2011 2:22:14 pm Thomas Schmidt wrote:

Hello,

well, I'm new to postgres and this is my post on this list :-)
Anyway, I've to batch-import bulk-csv data into a staging database (as
part of an ETL-"like" pocess). The data ought to be read via STDIN,
however for keeping in simple and stupid, saving it to a file and
importing afterwards is also an option. Sticking my nose into the docs,
I noticed that copy[1] as well as pg_import[2] are able to do it.

However, there are some additional columns of the staging table (job id,
etc.) that have to be set in order to identify imported rows. These
attributes are not part of the data coming from STDIN (since its
meta-data) and I see no way for specifying default values for "missing"
cvs columns. (imho copy and pg_bulkload will use table defaults for
missing rows - do I miss something?).

Thus - do you have any clue on designing an fast bulk-import for staging
data?

Thanks in advance,
Thomas

[1] http://www.postgresql.org/docs/9.0/static/sql-copy.html
[2] http://pgbulkload.projects.postgresql.org/pg_bulkload.html

Check out pgloader:
http://pgloader.projects.postgresql.org/

In particular:

"user_defined_columns

Those are special columns not found in the data file but which you want to
load into the database. The configuration options beginning with udc_ are taken
as column names with constant values. The following example define the column c
as having the value constant value for each and every row of the input data
file.

udc_c = constant value

The option copy_columns is used to define the exact columnsList given to
COPY.

A simple use case is the loading into the same database table of data coming
from more than one file. If you need to keep track of the data origin, add a
column to the table model and define a udc_ for pgloader to add a constant
value in the database.

Using user-defined columns require defining copy_columns and is not
compatible with only_cols usage.
copy_columns

This options defines the columns to load from the input data file and the
user defined columns, and in which order to do this. Place here the column
names separated by commas.

copy_columns = b, c, d

This option is required if any user column is defined, and conflicts with
the only_cols option. It won't have any effect when used in a section where no
user column is defined.
"

--
Adrian Klaver
adrian.klaver@gmail.com

#3Thomas Schmidt
postgres@stephan.homeunix.net
In reply to: Adrian Klaver (#2)
pgloader an Indexes / was: Re: CSV-bulk import and defaults

Hello,

Am 03.01.11 00:06, schrieb Adrian Klaver:

On Sunday 02 January 2011 2:22:14 pm Thomas Schmidt wrote:

well, I'm new to postgres and this is my post on this list :-)
Anyway, I've to batch-import bulk-csv data into a staging database (as
part of an ETL-"like" pocess). The data ought to be read via STDIN,
however for keeping in simple and stupid, saving it to a file and
importing afterwards is also an option. Sticking my nose into the docs,
I noticed that copy[1] as well as pg_import[2] are able to do it.

However, there are some additional columns of the staging table (job id,
etc.) that have to be set in order to identify imported rows. These
attributes are not part of the data coming from STDIN (since its
meta-data) and I see no way for specifying default values for "missing"
cvs columns. (imho copy and pg_bulkload will use table defaults for
missing rows - do I miss something?).

[1] http://www.postgresql.org/docs/9.0/static/sql-copy.html
[2] http://pgbulkload.projects.postgresql.org/pg_bulkload.html

Check out pgloader:
http://pgloader.projects.postgresql.org/

Thanks a lot - that's what I need. :-)
Btw. What about indexes?
http://www.postgresql.org/docs/9.0/interactive/populate.html suggests to
remove indexes before importing via copy (for obvious reasons).
Does pgloader take indexes into account or do I need to handle 'em manually?

Thanks in adance,
Thomas

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Thomas Schmidt (#3)
Re: pgloader an Indexes / was: Re: CSV-bulk import and defaults

On Monday 03 January 2011 12:48:22 am Thomas Schmidt wrote:

Thanks a lot - that's what I need. :-)
Btw. What about indexes?
http://www.postgresql.org/docs/9.0/interactive/populate.html suggests to
remove indexes before importing via copy (for obvious reasons).
Does pgloader take indexes into account or do I need to handle 'em
manually?

Thanks in adance,
Thomas

I don't know. I have used it to load data into holding tables that have no
indexes. My guess is you will have to handle them manually.

--
Adrian Klaver
adrian.klaver@gmail.com

#5Andrej Ricnik-Bay
andrej.groups@gmail.com
In reply to: Thomas Schmidt (#1)
Re: CSV-bulk import and defaults

On 3 January 2011 11:22, Thomas Schmidt <postgres@stephan.homeunix.net> wrote:

Thus - do you have any clue on designing an fast bulk-import for staging
data?

As you're talking about STDIN ... have you considered
piping the input-data through awk or sed to achieve a
pre-populated empty meta data field? Easy enough,
low CPU overhead.

Thanks in advance,
Thomas

Cheers,
Andrej

--
Please don't top post, and don't use HTML e-Mail :}  Make your quotes concise.

http://www.georgedillon.com/web/html_email_is_evil.shtml