Desperately Seeking Regular Expression
Hi all -
I am porting a PROGRESS database to PostgreSQL.
I've had success previously doing a port - but from FoxPro which
allows one to dump data delimited by tabs. Unfortunately, PROGRESS
dumps fields delimited by whitespace rather than tabs and I can find no
documentation on how to alter this behaviour.
I read the recent post wherein someone used awk to change whitespace
to tabs:
cat $input | awk '{ print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t" \
$6"\t"$7"\t" }' > $input.out
I am using this with good effect. However, I run into trouble as
inside my dump file(s) there are doublequoted character strings.
awk is changing the whitespace delimited words inside the char strs
into tab delimited words inside strings. Ouch.
What follows is my inept effort to get sed on my side as I try to sort
this out:
sed -e 's/" *"/ /g' $input.out > $input.sql
This is a miserable failure as it simply converts all the tabs back
to whitespace. I've tried escaping the double quotes in the regex
but then sed changes nothing.
Can someone put me out of my misery? Anyone have suggestions on a
regular expression that will:
Convert tabs to whitespace *inside* of double quoted strings *only* ???
I don't care if the regex is for sed/awk/perl, whatever, I need to get
the job done!
TIA!
Stuck in Staten Island,
Tom
----
North Richmond Community Mental Health Center
---
Thomas Good tomg@ { admin | q8 } .nrnet.org
Phone: 718-354-5528
Fax: 718-354-5056
Powered By: Slackware 3.6 PostgreSQL 6.3.2
---
/* Die Wahrheit Ist Irgendwo Da Draussen... */
Import Notes
Reply to msg id not found: Pine.LNX.4.04.9904270843190.5314-100000@cumquat.fruit.com
I solved something like this recently in perl. It's not terribly
efficient, but it is simple. I'm doing this from memory, so it may need
some debugging. Use something along the lines of
#!/usr/local/bin/perl
while (<>) {
@a = split /(\")/;
# This gives you a list with some of the items being double-quotes
# Now you need to figure out which ones were inside double quotes
my $b;
if ($a[0] eq '"') {
# we started with a double quoted string, so join th e 1st 3 fields
# and stick them on the new string
$b = join('',splice(@a,0,3))."\t";
}
while (@a) {
$b .= join("\t",split(' ',shift @a))."\t";
# if there is more then we have another double quoted string
$b = join('',splice(@a,0,3))."\t" if @a;
}
# Remove the last tab and replace with a newline
$b =~ s/\t$/\n/;
print $b;
}
Adriaan
First ahve postgres "copy" the file using tabs as the delimiter then this is
what I used to convert a "copy into" file to foxbase importable format...
s/([^\t]*?)\t([^\t]*?)\t([+-]*\d*?)\t([+-]*\d*?)\t([^\t]*?)\t([^\t]*?)\t([^\
t]*?)\t([^\t]*?)\t([^\t]*?)\t([^\t]*?)\t([^\t]*?)\t([^\t]*?)\t([^\t]*?)\t([^
\t]*?)\t([^\t]*?)\t([^\t]*?)\t([^\t]*?)\t([^\t]*?)\t([^\t]*?)\t([^\t]*?)\t([
^\t]*?)\t([^\t]*?)\t([^\t]*?)\t(.*)/"$1"\t"$2"\t$3\t$4\t"$5"\t"$6"\t"$7"\t"$
8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15"\t"$16"\t"$17"\t"$18"\t"$1
9"\t"$20"\t"$21"\t"$22"/;
As you can see the text file has 22 fields in it with some of them being
numeric and others being text (the quoted ones).
If you don't understand some of the regex expressions let me know....
-----Original Message-----
From: Thomas Good <tomg@q8.nrnet.org>
To: pgsql-novice@postgreSQL.org <pgsql-novice@postgreSQL.org>
Cc: pgsql-general@postgreSQL.org <pgsql-general@postgreSQL.org>
Date: Tuesday, April 27, 1999 8:31 AM
Subject: [GENERAL] Desperately Seeking Regular Expression
Show quoted text
Hi all -
I am porting a PROGRESS database to PostgreSQL.
I've had success previously doing a port - but from FoxPro which
allows one to dump data delimited by tabs. Unfortunately, PROGRESS
dumps fields delimited by whitespace rather than tabs and I can find no
documentation on how to alter this behaviour.I read the recent post wherein someone used awk to change whitespace
to tabs:cat $input | awk '{ print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t" \
$6"\t"$7"\t" }' > $input.outI am using this with good effect. However, I run into trouble as
inside my dump file(s) there are doublequoted character strings.
awk is changing the whitespace delimited words inside the char strs
into tab delimited words inside strings. Ouch.What follows is my inept effort to get sed on my side as I try to sort
this out:
sed -e 's/" *"/ /g' $input.out > $input.sqlThis is a miserable failure as it simply converts all the tabs back
to whitespace. I've tried escaping the double quotes in the regex
but then sed changes nothing.Can someone put me out of my misery? Anyone have suggestions on a
regular expression that will:Convert tabs to whitespace *inside* of double quoted strings *only* ???
I don't care if the regex is for sed/awk/perl, whatever, I need to get
the job done!TIA!
Stuck in Staten Island,
Tom
----
North Richmond Community Mental Health Center
---
Thomas Good tomg@ { admin | q8 } .nrnet.org
Phone: 718-354-5528
Fax: 718-354-5056
Powered By: Slackware 3.6 PostgreSQL 6.3.2
---
/* Die Wahrheit Ist Irgendwo Da Draussen... */
Import Notes
Resolved by subject fallback
At 16:35 +0300 on 27/04/1999, Thomas Good wrote:
I've had success previously doing a port - but from FoxPro which
allows one to dump data delimited by tabs. Unfortunately, PROGRESS
dumps fields delimited by whitespace rather than tabs and I can find no
documentation on how to alter this behaviour.I read the recent post wherein someone used awk to change whitespace
to tabs:cat $input | awk '{ print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t" \
$6"\t"$7"\t" }' > $input.outI am using this with good effect. However, I run into trouble as
inside my dump file(s) there are doublequoted character strings.
awk is changing the whitespace delimited words inside the char strs
into tab delimited words inside strings. Ouch.
I have a feeling that you are missing additional points. For example, if
you want to use the resulting text as input for COPY, strings should not be
delimited within quotes. And possible tabs and newlines and backslashes
within the file should be properly preceded with "\".
And what do you mean by the fact that the output is delimited by
whitespaces? That there is a single whitespace between the fields? Because
I think the awk above would join null fields in such a case. Or did you
mean that it outputs a fixed width file? That is, that the first field is
from column 1 to column 20, and if it's shorter, it adds spaces until
column 20?
That would require a different treatment.
If you give a more detailed description, you may get a better solution.
Herouth
--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma
On Thu, 29 Apr 1999, Herouth Maoz wrote:
At 16:35 +0300 on 27/04/1999, Thomas Good wrote:
I've had success previously doing a port - but from FoxPro which
allows one to dump data delimited by tabs. Unfortunately, PROGRESS
dumps fields delimited by whitespace rather than tabs and I can find no
documentation on how to alter this behaviour.I read the recent post wherein someone used awk to change whitespace
to tabs:cat $input | awk '{ print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t" \
$6"\t"$7"\t" }' > $input.outI am using this with good effect. However, I run into trouble as
inside my dump file(s) there are doublequoted character strings.
awk is changing the whitespace delimited words inside the char strs
into tab delimited words inside strings. Ouch.
I have a feeling that you are missing additional points. For example, if
you want to use the resulting text as input for COPY, strings should not be
delimited within quotes. And possible tabs and newlines and backslashes
within the file should be properly preceded with "\".
Hi Herouth! How goes it? I'm assuming you got Solaris and PG happy.
I'm wrestling with UnixWare now - just got the backend to compile, now
fighting with ecpg...(Bruce M is probably rather weary of me by now!)
Anyway -
Thanks for jumping into the breach...what I do is this:
0) use awk to create tabs where whitespace exists as a field separator.
1) use perl to tr tabs back to whitespace within double quoted strings.
2) use sed to change "" to \N (PROGRESS nulls are idiosyncratic/idiotic)
3) use sed to change ? (the PROGRESS unknown value) to \N
4) use sed to strip the remaining single quotes
And that is essentially it. The dump file loads correctly.
MANY thanks to Oliver E for sending me a perl script that
completes step two...(this is a very elegant solution).
Thanks also to Adriaan Joubert who also sent me some code.
(Merci!) Here is Olly's script (I call this from an awk
script then redirect the output - crude but effective):
#!/usr/bin/perl
$input = $ARGV[0];
open (DUMPFILE, "$input.out");
while (<DUMPFILE>) {
if ($_ =~ '"') {
@ln = split /"/, $_;
$i = 0;
foreach $elem (@ln) {
if ( $i % 2) {
print '"';
$elem =~ tr/\t/ /;
print $elem;
print '"';
} else {
print $elem;
}
$i++;
}
} else {
print $_;
}
}
close DUMPFILE;
It is a relief to have:
1) my dump files ready for loading into PG
2) the UnixWare PG binaries almost ready for primetime.
When this is finished we will have moved my entire shop (the Dept of
Psychiatry - we span two hospital campuses) from PROGRESS on UnixWare
and FoxPro on DOS to PG on Linux/FreeBSD. PG on UnixWare is an
interim step. Once the database is ported I am going to commence working
on losing UnixWare...
Nice talking to you - be well!
Tom
----
North Richmond Community Mental Health Center
---
Thomas Good tomg@ { admin | q8 } .nrnet.org
Phone: 718-354-5528
Fax: 718-354-5056
Powered By: Slackware 3.6 PostgreSQL 6.3.2
---
/* Die Wahrheit Ist Irgendwo Da Draussen... */
Import Notes
Resolved by subject fallback
At 15:31 +0300 on 29/04/1999, Thomas Good wrote:
0) use awk to create tabs where whitespace exists as a field separator.
1) use perl to tr tabs back to whitespace within double quoted strings.
2) use sed to change "" to \N (PROGRESS nulls are idiosyncratic/idiotic)
3) use sed to change ? (the PROGRESS unknown value) to \N
4) use sed to strip the remaining single quotes
Why not do them all in perl? Why run through 4 separate steps if you are
already going through perl?
And did you mean remaining single quotes (') or remaining double quotes (")?
You can first unify steps 0 and 1. Your perl program splits by
double-quotes, and processes every second element (which indicates it's
inside the quotes). You should use the same principle, but process only the
even-numbered strings, changing spaces to tabs, not the other way around.
foreach $elem (@ln) {
if ( $i % 2) {
print '"';
$elem =~ tr/\t/ /;
print $elem;
print '"';
} else {
print $elem;
}
$i++;
Instead:
foreach $elem (@ln) {
if ( $i % 2) {
print '"'.$elem.'"';
} else {
$elem =~ tr/ /\t/;
print $elem;
}
$i++;
(May need modifications, especially in the initialization part, but you get
the idea).
Adding the other modifications (assuming you meant remaining double
quotes), you get something along the lines of:
foreach $elem (@ln) {
if ( $i % 2) {
if ( $elem eq "" ) {
print "\\N";
} else {
print $elem;
}
} else {
$elem =~ tr/ /\t/;
$elem =~ s/?/\\N/;
print $elem;
}
$i++;
Herouth
--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma