importing db as text files

Started by expectover 22 years ago38 messagesgeneral
Jump to latest
#1expect
expect@ihubbell.com

What's the big deal with importing text files? I have a 70 MB file to import
and it's been one problem after another. I used the copy command and it appears
that it's just not possible. I finally massaged the file into a .sql file and
ran that using \i db.sql but that failed too because I overlooked ' in names
like D'Adario. The other problem I encountered was that a numeric field had
to have data in it, pg would not default to the default value. So instead of
massaging all the data again I decided to change the data type for that column.
This is my first experience with postgresql and I'm wondering if I should expect
to encounter similar pain as I go further into this? So far it's been very
painful trying to do what I thought would be easy and what I think should be
easy.

PostgreSQL 7.3.4 on linux redhat 9

#2Dann Corbit
DCorbit@connx.com
In reply to: expect (#1)
Re: importing db as text files

-----Original Message-----
From: expect [mailto:expect@ihubbell.com]
Sent: Wednesday, August 13, 2003 2:14 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] importing db as text files

What's the big deal with importing text files? I have a 70
MB file to import and it's been one problem after another. I
used the copy command and it appears that it's just not
possible. I finally massaged the file into a .sql file and
ran that using \i db.sql but that failed too because I
overlooked ' in names like D'Adario. The other problem I
encountered was that a numeric field had to have data in it,
pg would not default to the default value. So instead of
massaging all the data again I decided to change the data
type for that column. This is my first experience with
postgresql and I'm wondering if I should expect to encounter
similar pain as I go further into this? So far it's been
very painful trying to do what I thought would be easy and
what I think should be easy.

I use the copy command all the time without problems. If the data
follows the format you describe for copy, there will be no problems. If
the data is some other format, then you will have to use perl or sed or
something to reformat it.

If your data is in some format that is not easy to massage into
something that copy wants to eat, then use an ODBC driver for text files
and move all the data with insert/select.

Expect no miracles. GI/GO.

#3Bruno Wolff III
bruno@wolff.to
In reply to: expect (#1)
Re: importing db as text files

On Wed, Aug 13, 2003 at 14:14:20 -0700,
expect <expect@ihubbell.com> wrote:

What's the big deal with importing text files? I have a 70 MB file to import
and it's been one problem after another. I used the copy command and it appears
that it's just not possible. I finally massaged the file into a .sql file and

That is unlikely, but without more details it is hard to say what you need
to do.

ran that using \i db.sql but that failed too because I overlooked ' in names
like D'Adario. The other problem I encountered was that a numeric field had
to have data in it, pg would not default to the default value. So instead of

You can use the keyword default in insert statements to get a default value.

massaging all the data again I decided to change the data type for that column.
This is my first experience with postgresql and I'm wondering if I should expect
to encounter similar pain as I go further into this? So far it's been very
painful trying to do what I thought would be easy and what I think should be
easy.

The impression I get is that you expect postgres to make a best guess when
presented with ambiguous data. That is a very dangerous thing to do. I would
much prefer ambiguous data be rejected so that I can make sure what I think
the value is, is the same as what the database thinks the value is.

#4Jason Godden
jasongodden@optushome.com.au
In reply to: expect (#1)
Re: importing db as text files

Hi expect,

Best way in my opinion is to use the copy table command. This way Pg will
actually 'massage' the data (string escapes and all) for you.

If you have complex files best to use a combo of sed/awk to pre-massage the
field delimiters and import the data to a temporary table. In most instances
it is best to pipe the data to the psql command using copy table from stdin
as from file requires that you are the process owner.

Always use a temporary import table and perform validation/further conversion
in that table (IMO).

eg (from one of my earlier posts)

cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"
| psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"

The first sed will replace all instances of "," with a tab character
(I achieve tab in my shell scripts with a line like TAB=$(printf "\t"))

The second sed will remove the start and end " because the first only checks
for ",".

The third sed will remove the Windows carriage returns (if any - depends on
the platform the file was generated from.

Note here that my file format never contains any " in a field so I can safely
run the second sed. If your import file contains these then you will need to
change the regex or use awk.

Regards,

Jason

Show quoted text

On Thu, 14 Aug 2003 07:14 am, expect wrote:

What's the big deal with importing text files? I have a 70 MB file to
import and it's been one problem after another. I used the copy command
and it appears that it's just not possible. I finally massaged the file
into a .sql file and ran that using \i db.sql but that failed too because I
overlooked ' in names like D'Adario. The other problem I encountered was
that a numeric field had to have data in it, pg would not default to the
default value. So instead of massaging all the data again I decided to
change the data type for that column. This is my first experience with
postgresql and I'm wondering if I should expect to encounter similar pain
as I go further into this? So far it's been very painful trying to do what
I thought would be easy and what I think should be easy.

PostgreSQL 7.3.4 on linux redhat 9

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

#5Gregory S. Williamson
gsw@globexplorer.com
In reply to: Jason Godden (#4)
Re: importing db as text files

I tend to use perl to preprocess dumps (in our case from Informix). It tends to work much faster than shell scripts (although your mileage may vary). I have to fix missing numeric values (if the column allows nulls why can't the copy command accept an empty field, I wonder?), missing dates. In other cases is massages date formats, spatial data, etc. For example, a crude program below to clean DOQQ metadata:

Greg W.
==================================
firenze% more infxunl2psql
#!/usr/dist/bin/perl -w

$FILE = $ARGV[0];
$OUTPUT = $ARGV[1];
$MODE = $ARGV[2];

open (INFILE,"$FILE");
open (OUTFILE,">$OUTPUT");

foreach $line (<INFILE>)
{
chop($line);
chop($line);
if (($MODE cmp "DOQ") == 0) {
($t_source_filename, $t_quadrangle_name, $t_west_longitude, $t_east_longitude, $t_north_latitude, $t_south_latitude, $t_production_date, $t_raster_order, $t_band_organization, $t_band_content, $t_bits_per_pixel, $t_samples_and_lines, $t_horizontal_datum, $t_horizontal_coordinate_system, $t_coordinate_zone, $t_horizontal_units, $t_horizontal_resolution, $t_secondary_horizontal_datum, $t_xy_origin, $t_secondary_xy_origin, $t_nw_quad_corner_xy, $t_ne_quad_corner_xy, $t_se_quad_corner_xy, $t_sw_quad_corner_xy, $t_secondary_nw_quad_xy, $t_secondary_ne_quad_xy, $tsecondary_se_quad_xy, $t_secondary_sw_quad_xy, $t_rmse_xy, $t_image_source, $t_source_dem_date, $t_agency, $t_producer, $t_production_system, $t_standard_version, $t_metadata_date, $t_data_file_size, $byte_count) = split(/\|/,$line);
if (length($t_production_date) == 0) {
$t_production_date = "\\N"; # psql seems to dump a blank data with this nomenclature
}
if (length($t_coordinate_zone) == 0) { # an integer
$t_coordinate_zone = 0;
}
if (length($t_band_content) == 0) {
$t_band_content = 0; # also an int
}
if (length($t_bits_per_pixel) == 0) {
$t_bits_per_pixel = 0; # reasonable default for an int ?
}
if (length($t_horizontal_resolution) == 0) {
$t_horizontal_resolution = 0.0;
}
if (length($t_secondary_horizontal_datum) == 0) {
$t_secondary_horizontal_datum = "\'\'";
}
if (length($t_rmse_xy) == 0) {
$t_rmse_xy = 0.0;
}
if (length($t_metadata_date) == 0) {
$t_metadata_date = "\\N";
}
if (length($t_data_file_size) == 0) {
$t_data_file_size = 0; # a big int
}
if (length($byte_count) == 0) {
$byte_count = 0; # reasonable default ? for an int
}
$out_line = $t_source_filename . "|" . $t_quadrangle_name . "|" . $t_west_longitude . "|" . $t_east_longitude . "|" . $t_north_latitude . "|" . $t_south_latitude . "|" . $t_production_date . "|" . $t_raster_order . "|" . $t_band_org
anization . "|" . $t_band_content . "|" . $t_bits_per_pixel . "|" . $t_samples_and_lines . "|" . $t_horizontal_datum . "|" . $t_horizontal_coordinate_system . "|" . $t_coordinate_zone . "|" . $t_horizontal_units . "|" . $t_horizontal_resolu
tion . "|" . $t_secondary_horizontal_datum . "|" . $t_xy_origin . "|" . $t_secondary_xy_origin . "|" . $t_nw_quad_corner_xy . "|" . $t_ne_quad_corner_xy . "|" . $t_se_quad_corner_xy . "|" . $t_sw_quad_corner_xy . "|" . $t_secondary_nw_quad_xy . "|" . $t_secondary_ne_quad_xy . "|" . $tsecondary_se_quad_xy . "|" . $t_secondary_sw_quad_xy . "|" . $t_rmse_xy . "|" . $t_image_source . "|" . $t_source_dem_date . "|" . $t_agency . "|" . $t_producer . "|" . $t_production_system . "|" . $t_standard_version . "|" . $t_metadata_date . "|" . $t_data_file_size . "|" . $byte_count;
print OUTFILE "$out_line\n";
}
else {
print OUTFILE "$line\n";
}
}
close INFILE;
close OUTFILE;

-----Original Message-----
From: Jason Godden [mailto:jasongodden@optushome.com.au]
Sent: Wednesday, August 13, 2003 2:35 PM
To: expect; pgsql-general@postgresql.org
Subject: Re: [GENERAL] importing db as text files

Hi expect,

Best way in my opinion is to use the copy table command. This way Pg will
actually 'massage' the data (string escapes and all) for you.

If you have complex files best to use a combo of sed/awk to pre-massage the
field delimiters and import the data to a temporary table. In most instances
it is best to pipe the data to the psql command using copy table from stdin
as from file requires that you are the process owner.

Always use a temporary import table and perform validation/further conversion
in that table (IMO).

eg (from one of my earlier posts)

cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"
| psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"

The first sed will replace all instances of "," with a tab character
(I achieve tab in my shell scripts with a line like TAB=$(printf "\t"))

The second sed will remove the start and end " because the first only checks
for ",".

The third sed will remove the Windows carriage returns (if any - depends on
the platform the file was generated from.

Note here that my file format never contains any " in a field so I can safely
run the second sed. If your import file contains these then you will need to
change the regex or use awk.

Regards,

Jason

On Thu, 14 Aug 2003 07:14 am, expect wrote:

What's the big deal with importing text files? I have a 70 MB file to
import and it's been one problem after another. I used the copy command
and it appears that it's just not possible. I finally massaged the file
into a .sql file and ran that using \i db.sql but that failed too because I
overlooked ' in names like D'Adario. The other problem I encountered was
that a numeric field had to have data in it, pg would not default to the
default value. So instead of massaging all the data again I decided to
change the data type for that column. This is my first experience with
postgresql and I'm wondering if I should expect to encounter similar pain
as I go further into this? So far it's been very painful trying to do what
I thought would be easy and what I think should be easy.

PostgreSQL 7.3.4 on linux redhat 9

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

#6expect
expect@ihubbell.com
In reply to: Bruno Wolff III (#3)
Re: importing db as text files

On Wed, 13 Aug 2003 16:30:04 -0500
Bruno Wolff III <bruno@wolff.to> wrote:

On Wed, Aug 13, 2003 at 14:14:20 -0700,
expect <expect@ihubbell.com> wrote:

What's the big deal with importing text files? I have a 70 MB file to import
and it's been one problem after another. I used the copy command and it appears
that it's just not possible. I finally massaged the file into a .sql file and

That is unlikely, but without more details it is hard to say what you need
to do.

What's unlikely? That it didn't work? But it didn't work.

ran that using \i db.sql but that failed too because I overlooked ' in names
like D'Adario. The other problem I encountered was that a numeric field had
to have data in it, pg would not default to the default value. So instead of

You can use the keyword default in insert statements to get a default value.

From what I've read it should use the default value when there is no value.
Is that not true? Is this a known issue?

massaging all the data again I decided to change the data type for that column.
This is my first experience with postgresql and I'm wondering if I should expect
to encounter similar pain as I go further into this? So far it's been very
painful trying to do what I thought would be easy and what I think should be
easy.

The impression I get is that you expect postgres to make a best guess when

No you've come to the wrong impression. I believe that the problem lies in the
fact that pg will not default to the default value when no value is present.

Show quoted text

presented with ambiguous data. That is a very dangerous thing to do. I would
much prefer ambiguous data be rejected so that I can make sure what I think
the value is, is the same as what the database thinks the value is.

#7expect
expect@ihubbell.com
In reply to: Dann Corbit (#2)
Re: importing db as text files

On Wed, 13 Aug 2003 14:24:30 -0700
"Dann Corbit" <DCorbit@connx.com> wrote:

-----Original Message-----
From: expect [mailto:expect@ihubbell.com]
Sent: Wednesday, August 13, 2003 2:14 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] importing db as text files

What's the big deal with importing text files? I have a 70
MB file to import and it's been one problem after another. I
used the copy command and it appears that it's just not
possible. I finally massaged the file into a .sql file and
ran that using \i db.sql but that failed too because I
overlooked ' in names like D'Adario. The other problem I
encountered was that a numeric field had to have data in it,
pg would not default to the default value. So instead of
massaging all the data again I decided to change the data
type for that column. This is my first experience with
postgresql and I'm wondering if I should expect to encounter
similar pain as I go further into this? So far it's been
very painful trying to do what I thought would be easy and
what I think should be easy.

I use the copy command all the time without problems. If the data
follows the format you describe for copy, there will be no problems. If
the data is some other format, then you will have to use perl or sed or
something to reformat it.

I believe the problem is that pg is unable to use the default value
when a value is not present.

If your data is in some format that is not easy to massage into
something that copy wants to eat, then use an ODBC driver for text files
and move all the data with insert/select.

Expect no miracles. GI/GO.

Importing a text file is a miracle? You're scaring me. ;^)

Show quoted text

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

#8expect
expect@ihubbell.com
In reply to: Jason Godden (#4)
Re: importing db as text files

On Thu, 14 Aug 2003 07:34:55 +1000
Jason Godden <jasongodden@optushome.com.au> wrote:

Hi expect,

Best way in my opinion is to use the copy table command. This way Pg will
actually 'massage' the data (string escapes and all) for you.

I guess we're of the same opinion. I did use the copy table command.
I believe the problem is that pg is unable to use the default value when a
value is not present.

Show quoted text

If you have complex files best to use a combo of sed/awk to pre-massage the
field delimiters and import the data to a temporary table. In most instances
it is best to pipe the data to the psql command using copy table from stdin
as from file requires that you are the process owner.

Always use a temporary import table and perform validation/further conversion
in that table (IMO).

eg (from one of my earlier posts)

cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"
| psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"

The first sed will replace all instances of "," with a tab character
(I achieve tab in my shell scripts with a line like TAB=$(printf "\t"))

The second sed will remove the start and end " because the first only checks
for ",".

The third sed will remove the Windows carriage returns (if any - depends on
the platform the file was generated from.

Note here that my file format never contains any " in a field so I can safely
run the second sed. If your import file contains these then you will need to
change the regex or use awk.

Regards,

Jason

On Thu, 14 Aug 2003 07:14 am, expect wrote:

What's the big deal with importing text files? I have a 70 MB file to
import and it's been one problem after another. I used the copy command
and it appears that it's just not possible. I finally massaged the file
into a .sql file and ran that using \i db.sql but that failed too because I
overlooked ' in names like D'Adario. The other problem I encountered was
that a numeric field had to have data in it, pg would not default to the
default value. So instead of massaging all the data again I decided to
change the data type for that column. This is my first experience with
postgresql and I'm wondering if I should expect to encounter similar pain
as I go further into this? So far it's been very painful trying to do what
I thought would be easy and what I think should be easy.

PostgreSQL 7.3.4 on linux redhat 9

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

#9expect
expect@ihubbell.com
In reply to: Gregory S. Williamson (#5)
Re: importing db as text files

On Wed, 13 Aug 2003 14:59:29 -0700
"Gregory S. Williamson" <gsw@globexplorer.com> wrote:

I tend to use perl to preprocess dumps (in our case from Informix). It tends to work much faster than shell scripts (although your mileage may vary). I have to fix missing numeric values (if the column allows nulls why can't the copy command accept an empty field, I wonder?), missing dates. In other cases is massages date formats, spatial data, etc. For example, a crude program below to clean DOQQ metadata:

I'm not having any problems massaging the data. I believe the problem is that
pg is unable to use the default value when a value is not present.

BTW FWIW I took a few liberties with your perl code below for ease of reading.

For $out_line you can use the little append operator like:

$out_line = $t_source_filename . "|";
$out_line .= $t_quadrangle_name . "|";
etc., etc.

Greg W.
==================================
firenze% more infxunl2psql
#!/usr/dist/bin/perl -w

$FILE = $ARGV[0];
$OUTPUT = $ARGV[1];
$MODE = $ARGV[2];

open (INFILE,"$FILE");
open (OUTFILE,">$OUTPUT");

foreach $line (<INFILE>)
{
chop($line);
chop($line);
if (($MODE cmp "DOQ") == 0)

{
($t_source_filename, $t_quadrangle_name, $t_west_longitude,
$t_east_longitude, $t_north_latitude, $t_south_latitude,
$t_production_date, $t_raster_order, $t_band_organization,
$t_band_content, $t_bits_per_pixel, $t_samples_and_lines,
$t_horizontal_datum, $t_horizontal_coordinate_system,
$t_coordinate_zone, $t_horizontal_units, $t_horizontal_resolution,
$t_secondary_horizontal_datum, $t_xy_origin, $t_secondary_xy_origin,
$t_nw_quad_corner_xy, $t_ne_quad_corner_xy, $t_se_quad_corner_xy,
$t_sw_quad_corner_xy, $t_secondary_nw_quad_xy, $t_secondary_ne_quad_xy,
$tsecondary_se_quad_xy, $t_secondary_sw_quad_xy, $t_rmse_xy, $t_image_source,
$t_source_dem_date, $t_agency, $t_producer, $t_production_system,
$t_standard_version, $t_metadata_date, $t_data_file_size, $byte_count)
= split(/\|/,$line);

Show quoted text

if (length($t_production_date) == 0) {
$t_production_date = "\\N"; # psql seems to dump a blank data with this nomenclature
}
if (length($t_coordinate_zone) == 0) { # an integer
$t_coordinate_zone = 0;
}
if (length($t_band_content) == 0) {
$t_band_content = 0; # also an int
}
if (length($t_bits_per_pixel) == 0) {
$t_bits_per_pixel = 0; # reasonable default for an int ?
}
if (length($t_horizontal_resolution) == 0) {
$t_horizontal_resolution = 0.0;
}
if (length($t_secondary_horizontal_datum) == 0) {
$t_secondary_horizontal_datum = "\'\'";
}
if (length($t_rmse_xy) == 0) {
$t_rmse_xy = 0.0;
}
if (length($t_metadata_date) == 0) {
$t_metadata_date = "\\N";
}
if (length($t_data_file_size) == 0) {
$t_data_file_size = 0; # a big int
}
if (length($byte_count) == 0) {
$byte_count = 0; # reasonable default ? for an int
}
$out_line = $t_source_filename . "|" . $t_quadrangle_name . "|" . $t_west_longitude . "|" . $t_east_longitude . "|" . $t_north_latitude . "|" . $t_south_latitude . "|" . $t_production_date . "|" . $t_raster_order . "|" . $t_band_org
anization . "|" . $t_band_content . "|" . $t_bits_per_pixel . "|" . $t_samples_and_lines . "|" . $t_horizontal_datum . "|" . $t_horizontal_coordinate_system . "|" . $t_coordinate_zone . "|" . $t_horizontal_units . "|" . $t_horizontal_resolu
tion . "|" . $t_secondary_horizontal_datum . "|" . $t_xy_origin . "|" . $t_secondary_xy_origin . "|" . $t_nw_quad_corner_xy . "|" . $t_ne_quad_corner_xy . "|" . $t_se_quad_corner_xy . "|" . $t_sw_quad_corner_xy . "|" . $t_secondary_nw_quad_xy . "|" . $t_secondary_ne_quad_xy . "|" . $tsecondary_se_quad_xy . "|" . $t_secondary_sw_quad_xy . "|" . $t_rmse_xy . "|" . $t_image_source . "|" . $t_source_dem_date . "|" . $t_agency . "|" . $t_producer . "|" . $t_production_system . "|" . $t_standard_version . "|" . $t_metadata_date . "|" . $t_data_file_size . "|" . $byte_count;
print OUTFILE "$out_line\n";
}
else {
print OUTFILE "$line\n";
}
}
close INFILE;
close OUTFILE;

-----Original Message-----
From: Jason Godden [mailto:jasongodden@optushome.com.au]
Sent: Wednesday, August 13, 2003 2:35 PM
To: expect; pgsql-general@postgresql.org
Subject: Re: [GENERAL] importing db as text files

Hi expect,

Best way in my opinion is to use the copy table command. This way Pg will
actually 'massage' the data (string escapes and all) for you.

If you have complex files best to use a combo of sed/awk to pre-massage the
field delimiters and import the data to a temporary table. In most instances
it is best to pipe the data to the psql command using copy table from stdin
as from file requires that you are the process owner.

Always use a temporary import table and perform validation/further conversion
in that table (IMO).

eg (from one of my earlier posts)

cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"
| psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"

The first sed will replace all instances of "," with a tab character
(I achieve tab in my shell scripts with a line like TAB=$(printf "\t"))

The second sed will remove the start and end " because the first only checks
for ",".

The third sed will remove the Windows carriage returns (if any - depends on
the platform the file was generated from.

Note here that my file format never contains any " in a field so I can safely
run the second sed. If your import file contains these then you will need to
change the regex or use awk.

Regards,

Jason

On Thu, 14 Aug 2003 07:14 am, expect wrote:

What's the big deal with importing text files? I have a 70 MB file to
import and it's been one problem after another. I used the copy command
and it appears that it's just not possible. I finally massaged the file
into a .sql file and ran that using \i db.sql but that failed too because I
overlooked ' in names like D'Adario. The other problem I encountered was
that a numeric field had to have data in it, pg would not default to the
default value. So instead of massaging all the data again I decided to
change the data type for that column. This is my first experience with
postgresql and I'm wondering if I should expect to encounter similar pain
as I go further into this? So far it's been very painful trying to do what
I thought would be easy and what I think should be easy.

PostgreSQL 7.3.4 on linux redhat 9

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

#10expect
expect@ihubbell.com
In reply to: Gregory S. Williamson (#5)
Re: importing db as text files

On Wed, 13 Aug 2003 14:59:29 -0700
"Gregory S. Williamson" <gsw@globexplorer.com> wrote:

I tend to use perl to preprocess dumps (in our case from Informix). It tends
to work much faster than shell scripts (although your mileage may vary). I

have to fix missing numeric values (if the column allows nulls why can't the
copy command accept an empty field, I wonder?), missing dates. In other cases

This seems to be the question I need answered as well. Does anyone know the
answer?

Show quoted text

is massages date formats, spatial data, etc. For example, a crude program
below to clean DOQQ metadata:

Greg W.
==================================
firenze% more infxunl2psql
#!/usr/dist/bin/perl -w

$FILE = $ARGV[0];
$OUTPUT = $ARGV[1];
$MODE = $ARGV[2];

open (INFILE,"$FILE");
open (OUTFILE,">$OUTPUT");

foreach $line (<INFILE>)
{
chop($line);
chop($line);
if (($MODE cmp "DOQ") == 0) {
($t_source_filename, $t_quadrangle_name, $t_west_longitude,
$t_east_longitude, $t_north_latitude, $t_south_latitude,
$t_production_date, $t_raster_order, $t_band_organization,
$t_band_content, $t_bits_per_pixel, $t_samples_and_lines,
$t_horizontal_datum, $t_horizontal_coordinate_system,
$t_coordinate_zone, $t_horizontal_units, $t_horizontal_resolution,
$t_secondary_horizontal_datum, $t_xy_origin, $t_secondary_xy_origin,
$t_nw_quad_corner_xy, $t_ne_quad_corner_xy, $t_se_quad_corner_xy,
$t_sw_quad_corner_xy, $t_secondary_nw_quad_xy,
$t_secondary_ne_quad_xy, $tsecondary_se_quad_xy,
$t_secondary_sw_quad_xy, $t_rmse_xy, $t_image_source,
$t_source_dem_date, $t_agency, $t_producer, $t_production_system,
$t_standard_version, $t_metadata_date, $t_data_file_size, $byte_count)
= split(/\|/,$line); if (length($t_production_date) == 0) {
$t_production_date = "\\N"; # psql seems to dump a blank data with
this nomenclature}
if (length($t_coordinate_zone) == 0) { # an integer
$t_coordinate_zone = 0;
}
if (length($t_band_content) == 0) {
$t_band_content = 0; # also an int
}
if (length($t_bits_per_pixel) == 0) {
$t_bits_per_pixel = 0; # reasonable default for an int ?
}
if (length($t_horizontal_resolution) == 0) {
$t_horizontal_resolution = 0.0;
}
if (length($t_secondary_horizontal_datum) == 0) {
$t_secondary_horizontal_datum = "\'\'";
}
if (length($t_rmse_xy) == 0) {
$t_rmse_xy = 0.0;
}
if (length($t_metadata_date) == 0) {
$t_metadata_date = "\\N";
}
if (length($t_data_file_size) == 0) {
$t_data_file_size = 0; # a big int
}
if (length($byte_count) == 0) {
$byte_count = 0; # reasonable default ? for an int
}
$out_line = $t_source_filename . "|" . $t_quadrangle_name . "|" .
$t_west_longitude . "|" . $t_east_longitude . "|" . $t_north_latitude
. "|" . $t_south_latitude . "|" . $t_production_date . "|" .
$t_raster_order . "|" . $t_band_org
anization . "|" . $t_band_content . "|" . $t_bits_per_pixel . "|" .
$t_samples_and_lines . "|" . $t_horizontal_datum . "|" .
$t_horizontal_coordinate_system . "|" . $t_coordinate_zone . "|" .
$t_horizontal_units . "|" . $t_horizontal_resolu tion . "|" .
$t_secondary_horizontal_datum . "|" . $t_xy_origin . "|" .
$t_secondary_xy_origin . "|" . $t_nw_quad_corner_xy . "|" .
$t_ne_quad_corner_xy . "|" . $t_se_quad_corner_xy . "|" . $t_sw_quad_corner_xy
. "|" . $t_secondary_nw_quad_xy . "|" . $t_secondary_ne_quad_xy . "|" .
$tsecondary_se_quad_xy . "|" . $t_secondary_sw_quad_xy . "|" . $t_rmse_xy .
"|" . $t_image_source . "|" . $t_source_dem_date . "|" . $t_agency . "|" .
$t_producer . "|" . $t_production_system . "|" . $t_standard_version . "|" .
$t_metadata_date . "|" . $t_data_file_size . "|" . $byte_count;
print OUTFILE "$out_line\n";
}
else {
print OUTFILE "$line\n";
}
}
close INFILE;
close OUTFILE;

-----Original Message-----
From: Jason Godden [mailto:jasongodden@optushome.com.au]
Sent: Wednesday, August 13, 2003 2:35 PM
To: expect; pgsql-general@postgresql.org
Subject: Re: [GENERAL] importing db as text files

Hi expect,

Best way in my opinion is to use the copy table command. This way Pg will
actually 'massage' the data (string escapes and all) for you.

If you have complex files best to use a combo of sed/awk to pre-massage the
field delimiters and import the data to a temporary table. In most instances
it is best to pipe the data to the psql command using copy table from stdin
as from file requires that you are the process owner.

Always use a temporary import table and perform validation/further conversion
in that table (IMO).

eg (from one of my earlier posts)

cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"

| psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"

The first sed will replace all instances of "," with a tab character
(I achieve tab in my shell scripts with a line like TAB=$(printf "\t"))

The second sed will remove the start and end " because the first only checks
for ",".

The third sed will remove the Windows carriage returns (if any - depends on
the platform the file was generated from.

Note here that my file format never contains any " in a field so I can safely
run the second sed. If your import file contains these then you will need to
change the regex or use awk.

Regards,

Jason

On Thu, 14 Aug 2003 07:14 am, expect wrote:

What's the big deal with importing text files? I have a 70 MB file to
import and it's been one problem after another. I used the copy command
and it appears that it's just not possible. I finally massaged the file
into a .sql file and ran that using \i db.sql but that failed too because I
overlooked ' in names like D'Adario. The other problem I encountered was
that a numeric field had to have data in it, pg would not default to the
default value. So instead of massaging all the data again I decided to
change the data type for that column. This is my first experience with
postgresql and I'm wondering if I should expect to encounter similar pain
as I go further into this? So far it's been very painful trying to do what
I thought would be easy and what I think should be easy.

PostgreSQL 7.3.4 on linux redhat 9

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

#11Martijn van Oosterhout
kleptog@svana.org
In reply to: Gregory S. Williamson (#5)
Re: importing db as text files

On Wed, Aug 13, 2003 at 02:59:29PM -0700, Gregory S. Williamson wrote:

I tend to use perl to preprocess dumps (in our case from Informix). It tends to work much faster than shell scripts (although your mileage may vary). I have to fix missing numeric values (if the column allows nulls why can't the copy command accept an empty field, I wonder?), missing dates. In other cases is massages date formats, spatial data, etc. For example, a crude program below to clean DOQQ metadata:

[snip]

$t_production_date = "\\N"; # psql seems to dump a blank data with this nomenclature

Umm, \N represents NULL. NULL is not a blank field, it's null. You could
also tell copy that a blank field represents a null but that might have
unexpected effects on text fields which are supposed to be blank.

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

"All that is needed for the forces of evil to triumph is for enough good
men to do nothing." - Edmond Burke
"The penalty good people pay for not being interested in politics is to be
governed by people worse than themselves." - Plato

#12Jason Godden
jasongodden@optushome.com.au
In reply to: expect (#10)
Re: importing db as text files

On Thu, 14 Aug 2003 02:17 pm, expect wrote:

On Wed, 13 Aug 2003 14:59:29 -0700

"Gregory S. Williamson" <gsw@globexplorer.com> wrote:

have to fix missing numeric values (if the column allows nulls why can't
the copy command accept an empty field, I wonder?), missing dates. In
other cases

This seems to be the question I need answered as well. Does anyone know
the answer?

Copy can accept an empty field (representing null):

copy datatable from stdin delimiter '\t' null '';

by default pg expects '\N' to mean null but you can override it:

Description: copy data between files and tables
Syntax:
COPY table [ ( column [, ...] ) ]
FROM { 'filename' | stdin }
[ [ WITH ]
[ BINARY ]
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
[ NULL [ AS ] 'null string' ] ]
COPY table [ ( column [, ...] ) ]
TO { 'filename' | stdout }
[ [ WITH ]
[ BINARY ]
[ OIDS ]
[ DELIMITER [ AS ] 'delimiter' ]
[ NULL [ AS ] 'null string' ] ]

Rgds,

Jason

#13Gregory S. Williamson
gsw@globexplorer.com
In reply to: Jason Godden (#12)
Re: importing db as text files

Two issues raised (other than my atrocious coding)

a) if defaults for a column are defined and copy is given a value of <> (implicit NULL) why not use the default ? [does an explicit \n work -- didn't try, I have to confess]

b) more generally, if copy finds a column that allows null and the data is a null (not explicitly defined as such, just no data) isn't that value valid, e.g. a NULL value ?

I kludged a simple way to make my data load, without really understanding why -- other than how "C" converts strings to integers ("atoi") -- a simple NULL can't be a NULL entry or at least a '0' if that is the default.

Not meant in ANY way as criticism, really, just trying to get a handle on this cool tool, and how it differs from ones I do know.

Apologies for any double postings I may have caused.

Greg W.

-----Original Message-----
From: Martijn van Oosterhout [mailto:kleptog@svana.org]
Sent: Wed 8/13/2003 9:24 PM
To: Gregory S. Williamson
Cc: expect; pgsql-general@postgresql.org
Subject: Re: [GENERAL] importing db as text files

#14Jason Godden
jasongodden@optushome.com.au
In reply to: expect (#8)
Re: importing db as text files

On Thu, 14 Aug 2003 01:52 pm, you wrote:

On Thu, 14 Aug 2003 07:34:55 +1000

Jason Godden <jasongodden@optushome.com.au> wrote:

Hi expect,

Best way in my opinion is to use the copy table command. This way Pg
will actually 'massage' the data (string escapes and all) for you.

I guess we're of the same opinion. I did use the copy table command.
I believe the problem is that pg is unable to use the default value when a
value is not present.

PG is behaving correctly IMO:

create table data (data1 int4 not null, data2 int4,data3 int4 not null default
10);

insert into data values (2,null,default) - OK
insert into data values (null,2,default) - Fail not null data1
insert into data values (2,null) - Fail? missing field?
insert into data (data1,data2) values (2,null) - OK data3 = default with
explicit field nomination ^

copy from... (essentially becomes - although it does something a bit different
behind the scenes):

insert into data (data1,data2,data3) values (x,y,z)

if data3 is specified not null default 10 and you have a line in your import
file which is translated thus:

2 2 \N - default null but you can nominate what that is
insert into data (data1,data2,data3) values (2,2,null);

this will fail because you are explicitly saying put null in a not null field.

So use an intermediatory table without not null constraints with copy from...
then use a query:

insert into realtable (data1,data2,data3) select data1,
case when data2 is null then default else data2 end,
data3 from data where data1 is not null; <-because data1 has no option to be
null or a default value etc...

Problem solved... I'd be curious as to how many ppl actually import their data
STRAIGHT into their production tables without integrity checking. Ofcourse
if you massage and manage it externally such as the method Greg uses then
you're in business too - either way I believe the pg copy syntax is correct
and makes sense. PG Copy CANT make a guess that you intend null or to skip
that field so the default pops in there - null and default are two very
different things. in fact null is oo (infinity?) different things...

Unless someone changes the insert behaviour in a later release then you will
have to come up with a 'massaged' way (Greg/Perl, intermediatory tables and
pl/pgsql functions). But then think about this:

insert into data values (1,2,null)

which by the proposed new null behaviour suddenly becomes:

insert into data values (1,2,default) (say default = 10);

but in reality the user simply mucked up on data entry, didn't actually mean
default and meant to insert 70 instead - they don't get a warning about it
and your data integrity is screwed. Not only that the db isn't paying strict
attention to the intended SQL syntax and the constraint management is moved
to the client - PG (and any other decent database - ie Oracle) is far above
the MySQL
cram-it-in-even-if-it-breaks-code-all-your-business-rules-in-your-client way
of doing things!

Default should be used in an explicit sense IMHO:

insert into data (data1,data2) values (1,2) - now data3 becomes default and
all is good

VERY different to: insert into data values (1,2,null);

#15Dave Cramer
pg@fastcrypt.com
In reply to: expect (#9)
query tuning

I have a query which definitely runs faster when sequential scans are
turned off. Which parameters do I tune? and which way?

After quickly perusing the docs, and google, I think it is the
random_page_cost?

Any help would be appreciated.

Dave
--
Dave Cramer <dave@fastcrypt.com>
fastcrypt

#16Dave Cramer
pg@fastcrypt.com
In reply to: expect (#9)
query tuning

I have a query which definitely runs faster when sequential scans are
turned off. Which parameters do I tune? and which way?

After quickly perusing the docs, and google, I think it is the
random_page_cost?

Any help would be appreciated.

Dave
--
Dave Cramer <dave@fastcrypt.com>
fastcrypt
--
Dave Cramer <Dave@micro-automation.net>

#17Bruno Wolff III
bruno@wolff.to
In reply to: expect (#6)
Re: importing db as text files

On Wed, Aug 13, 2003 at 20:45:55 -0700,
expect <expect@ihubbell.com> wrote:

On Wed, 13 Aug 2003 16:30:04 -0500
Bruno Wolff III <bruno@wolff.to> wrote:

On Wed, Aug 13, 2003 at 14:14:20 -0700,
expect <expect@ihubbell.com> wrote:

What's the big deal with importing text files? I have a 70 MB file to import
and it's been one problem after another. I used the copy command and it appears
that it's just not possible. I finally massaged the file into a .sql file and

That is unlikely, but without more details it is hard to say what you need
to do.

What's unlikely? That it didn't work? But it didn't work.

That it isn't possible to load your data using the copy command.
You may need to do some transformation before sending it to copy.
Using copy is probably going to give you the fastest load time.

From what I've read it should use the default value when there is no value.
Is that not true? Is this a known issue?

Where did you read that? What do you mean by 'no value'?

The syntax for the insert command is given in the manual:
INSERT INTO table [ ( column [, ...] ) ]
{ DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) | query }

No you've come to the wrong impression. I believe that the problem lies in the
fact that pg will not default to the default value when no value is present.

And how do you think 'no value' is represented? If you use default to
respresent 'no value' then things will work as you expect for insert
statements. When you are using copy, defaults can't be used on a row
by row basis (but unlisted columns will get default values) and you will
need to inlcude the value that you want in each row.

#18Bruno Wolff III
bruno@wolff.to
In reply to: Dave Cramer (#16)
Re: query tuning

Please don't reply to messages to start a new thread.

On Thu, Aug 14, 2003 at 08:09:23 -0400,
Dave Cramer <Dave@micro-automation.net> wrote:

I have a query which definitely runs faster when sequential scans are
turned off. Which parameters do I tune? and which way?

After quickly perusing the docs, and google, I think it is the
random_page_cost?

Have you run vacuum analyze?

Can you supply a copy of the query and explain analyze output?

#19Bruno Wolff III
bruno@wolff.to
In reply to: Jason Godden (#12)
Re: importing db as text files

On Thu, Aug 14, 2003 at 14:50:41 +1000,
Jason Godden <jasongodden@optushome.com.au> wrote:

Copy can accept an empty field (representing null):

That assumes that an empty field is what he means by 'no value'.
For some data types an empty string is a perfectly valid data type.

copy datatable from stdin delimiter '\t' null '';

by default pg expects '\N' to mean null but you can override it:

He wants the default value, not null though. Copy doesn't have a way
to specify a string to be replaced by the default value. It probably
wouldn't be too hard to add this option, but it is too late for 7.4.

If the default value is a constant preprocessing should be an easy way
to handle the issue. Another option may be to change null's to the
default after importing the data (assuming there aren't actual nulls
in the data). If he is trying to use nextval for all rows on one column,
then not entering that column would be the way to go.

We haven't seen any example of the data he is trying to import so it is
hard to give him specific advice.

#20Bruno Wolff III
bruno@wolff.to
In reply to: Gregory S. Williamson (#13)
Re: importing db as text files

On Thu, Aug 14, 2003 at 00:50:27 -0700,
"Gregory S. Williamson" <gsw@globexplorer.com> wrote:

Two issues raised (other than my atrocious coding)

a) if defaults for a column are defined and copy is given a value of <> (implicit NULL) why not use the default ? [does an explicit \n work -- didn't try, I have to confess]

Because empty strings are valid data and you can't go assuming they are
null, since you would then need a different string to represent the empty
string.

b) more generally, if copy finds a column that allows null and the data is a null (not explicitly defined as such, just no data) isn't that value valid, e.g. a NULL value ?

It might be bad data. The copy statement provides a way to define a string that
represents the null string.

It would probably be reasonable to add a way to specify a string to be
replaced by the default value.

#21expect
expect@ihubbell.com
In reply to: Bruno Wolff III (#17)
#22Bruno Wolff III
bruno@wolff.to
In reply to: expect (#21)
#23Murthy Kambhampaty
murthy.kambhampaty@goeci.com
In reply to: Bruno Wolff III (#22)
#24expect
expect@ihubbell.com
In reply to: Bruno Wolff III (#22)
#25Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: expect (#24)
#26expect
expect@ihubbell.com
In reply to: Stephan Szabo (#25)
#27Jason Godden
jasongodden@optushome.com.au
In reply to: expect (#26)
#28Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: expect (#26)
#29Franco Bruno Borghesi
franco@akyasociados.com.ar
In reply to: Dave Cramer (#15)
#30Murthy Kambhampaty
murthy.kambhampaty@goeci.com
In reply to: Stephan Szabo (#28)
#31expect
expect@ihubbell.com
In reply to: Stephan Szabo (#28)
#32expect
expect@ihubbell.com
In reply to: expect (#31)
#33Dennis Gearon
gearond@cvc.net
In reply to: Stephan Szabo (#28)
#34Murthy Kambhampaty
murthy.kambhampaty@goeci.com
In reply to: Dennis Gearon (#33)
#35Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: expect (#31)
#36Fernando Nasser
fnasser@redhat.com
In reply to: Dave Cramer (#15)
#37Gregory S. Williamson
gsw@globexplorer.com
In reply to: Stephan Szabo (#35)
#38scott.marlowe
scott.marlowe@ihs.com
In reply to: Dave Cramer (#15)