importing a messy text file

Started by Willy-Bas Loosalmost 12 years ago11 messagesgeneral
Jump to latest
#1Willy-Bas Loos
willybas@gmail.com

Hi,

I have a 56GB textfile that i want to import into postgres.
The file is tab delimited and not quoted.
I deleted the header with the column names (using sed) so that i could use
COPY with the non-csv text type (because some of the text values contain
quotes).

I had some minor trouble with the file which i managed, but now i have one
where i can't think of a solution, even though it seems so simple.

The problem is this:
There is a tab after the last column, in many but not all records.
When i ran into the extra tab i added a dummy column in the destination
table but now COPY thows an error because the data for the dummy column is
missing (on record ~275K of about 150M).

The file is too big to edit by hand and anyway it would probably not be
feasible to manually add tabs for every record that misses one, although i
don't know how many it would be.

I realize that there could be other showstoppers in the file, like missing
tabs in the middle or extra tabs in the middle, but i would like to try and
get this fixed.

Maybe it would be feasible to add every record as 1 value and then
splitting those into columns using postgres text processing.
Or maybe there is an (undocumented?) option in copy or \copy to ignore
extra columns.
Or maybe there is some no-sql software where i can import this and then
structure the data before i pass it to postgres..

Do you have any tips, please?

Cheers,

--
Willy-Bas Loos

#2Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Willy-Bas Loos (#1)
Re: importing a messy text file

On Wed, Apr 30, 2014 at 10:07:09AM +0200, Willy-Bas Loos wrote:

it would probably not be
feasible to manually add tabs for every record that misses one, although i
don't know how many it would be.

Why not a) let a script do that, b) remove tabs from the end
of lines as needed ?

All in all it sounds like you want to use a scripting
language to sanitize the file.

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

In reply to: Willy-Bas Loos (#1)
Re: importing a messy text file

On Wed, 30 Apr 2014 10:07:09 +0200
Willy-Bas Loos <willybas@gmail.com> wrote:

Hi,

I have a 56GB textfile that i want to import into postgres.

!!!

The problem is this:
There is a tab after the last column, in many but not all records.

You could clean it up with something like s/^I$// , don't you?

--
Alberto Cabello Sánchez
<alberto@unex.es>

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Willy-Bas Loos
willybas@gmail.com
In reply to: Karsten Hilbert (#2)
Re: importing a messy text file

Alberto, it would be hard to use sed( s/^I$// ), because there is no
explicit NULL value and there are many NULLs in the last column.
So i can't be sure how many tabs should be in the end of each line.

Yes, Karsten, maybe scripting would be the easiest way to fix this, i would
then probably insert this line for line. That's a possibility. It might
take a long time to run, but that's not much of an issue.

--
Willy-Bas Loos

In reply to: Willy-Bas Loos (#4)
Re: importing a messy text file

On Wed, 30 Apr 2014 10:47:12 +0200
Willy-Bas Loos <willybas@gmail.com> wrote:

Alberto, it would be hard to use sed( s/^I$// ), because there is no
explicit NULL value and there are many NULLs in the last column.
So i can't be sure how many tabs should be in the end of each line.

Ok, I understand (I think).

What about using "cut" to strip the extra fields?

--
Alberto Cabello Sánchez
<alberto@unex.es>

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Rémi Cura
remi.cura@gmail.com
In reply to: Willy-Bas Loos (#4)
Re: importing a messy text file

Hey,
with latest version 9.3 you can use a copy from with a programm as argument.
I had a similar problem (an extra space at the end of each line), so I used
sed in a pipeline to remove it before feeding it to database.

You can use the unix | pipe for most of the stuff.

If you have an older version you can use the mkfifo command to "trick"
postgres into believing he will copy from a file.

Cheers,
Rémi-C

2014-04-30 10:47 GMT+02:00 Willy-Bas Loos <willybas@gmail.com>:

Show quoted text

Alberto, it would be hard to use sed( s/^I$// ), because there is no
explicit NULL value and there are many NULLs in the last column.
So i can't be sure how many tabs should be in the end of each line.

Yes, Karsten, maybe scripting would be the easiest way to fix this, i
would then probably insert this line for line. That's a possibility. It
might take a long time to run, but that's not much of an issue.

--
Willy-Bas Loos

#7Willy-Bas Loos
willybas@gmail.com
In reply to: Alberto Cabello Sánchez (#5)
Re: importing a messy text file

On Wed, Apr 30, 2014 at 11:03 AM, Alberto Cabello Sánchez
<alberto@unex.es>wrote:

What about using "cut" to strip the extra fields?

Wow, i didn't know "cut", this looks promising.

thanks.

--
Willy-Bas Loos

#8Willy-Bas Loos
willybas@gmail.com
In reply to: Rémi Cura (#6)
Re: importing a messy text file

On Wed, Apr 30, 2014 at 11:06 AM, Rémi Cura <remi.cura@gmail.com> wrote:

with latest version 9.3 you can use a copy from with a programm as
argument.
You can use the unix | pipe for most of the stuff.

If you have an older version you can use the mkfifo command to "trick"
postgres into believing he will copy from a file.

Hi, you're right i can see the new feature in the docs.
But since i am working from the (bash) command line already, i don't see
any use for this.
I already pass the data like this:
cat <file> |psql -c "copy <table> from stdin NULL ''" <db> > <file>.log 2>&1

It's especially handy if you want to stay in sql(transaction), i gues.

Cheers,

--
Willy-Bas Loos

#9Rob Sargent
robjsargent@gmail.com
In reply to: Willy-Bas Loos (#8)
Re: importing a messy text file

On 04/30/2014 03:50 AM, Willy-Bas Loos wrote:

On Wed, Apr 30, 2014 at 11:06 AM, Rémi Cura <remi.cura@gmail.com
<mailto:remi.cura@gmail.com>> wrote:

with latest version 9.3 you can use a copy from with a programm as
argument.
You can use the unix | pipe for most of the stuff.

If you have an older version you can use the mkfifo command to
"trick" postgres into believing he will copy from a file.

Hi, you're right i can see the new feature in the docs.
But since i am working from the (bash) command line already, i don't
see any use for this.
I already pass the data like this:
cat <file> |psql -c "copy <table> from stdin NULL ''" <db> >
<file>.log 2>&1

It's especially handy if you want to stay in sql(transaction), i gues.

Cheers,

--
Willy-Bas Loos

Unless you know this is the only problem (extra/missing tab) I would
triage the file with a scripting language. One favourite check is

awk '{a[NF]++}END{print "Field count: Record count";for (i in
a){printf "%11d : %d", i, a[i]}}

If you only have the tab problem you will get two lines of output with
field count = N, N-1. Take care in setting the field separator.

#10bricklen
bricklen@gmail.com
In reply to: Willy-Bas Loos (#1)
Re: importing a messy text file

On Wed, Apr 30, 2014 at 1:07 AM, Willy-Bas Loos <willybas@gmail.com> wrote:

Hi,

I have a 56GB textfile that i want to import into postgres.
The file is tab delimited and not quoted.

Would Pgloader be an option? http://tapoueh.org/pgloader/

#11Michael Paquier
michael@paquier.xyz
In reply to: bricklen (#10)
Re: importing a messy text file

On Thu, May 1, 2014 at 1:30 AM, bricklen <bricklen@gmail.com> wrote:

On Wed, Apr 30, 2014 at 1:07 AM, Willy-Bas Loos <willybas@gmail.com> wrote:

Hi,

I have a 56GB textfile that i want to import into postgres.
The file is tab delimited and not quoted.

Would Pgloader be an option? http://tapoueh.org/pgloader/

Or pg_bulkload? Contrary to the in-core COPY, it can filter
"incorrect" tuple data at the line level.
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general