Using COPY to import large xml file
Hello Everyone,
I have downloaded the Stackoverflow posts xml (contains all SO questions
till date).. the file is around 70GB.. I wanna import the data in those xml
to my table.. is there a way to do so in postgres?
Thanks,
Anto.
On 06/24/2018 05:25 PM, Anto Aravinth wrote:
Hello Everyone,
I have downloaded the Stackoverflow posts xml (contains all SO questions till
date).. the file is around 70GB.. I wanna import the data in those xml to my
table.. is there a way to do so in postgres?Thanks,
Anto.
Hello Anto,
I used this tool :
https://github.com/Networks-Learning/stackexchange-dump-to-postgres
Regards,
--
Adrien NAYRAT
https://blog.anayrat.info
Thanks for the response. I'm not sure, how long does this tool takes for
the 70GB data.
I used node to stream the xml files into inserts.. which was very slow..
Actually the xml contains 40 million records, out of which 10Million took
around 2 hrs using nodejs. Hence, I thought will use COPY command, as
suggested on the internet.
Definitely, will try the code and let you know.. But looks like it uses the
same INSERT, not copy.. interesting if it runs quick on my machine.
On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info>
wrote:
Show quoted text
On 06/24/2018 05:25 PM, Anto Aravinth wrote:
Hello Everyone,
I have downloaded the Stackoverflow posts xml (contains all SO questions
till
date).. the file is around 70GB.. I wanna import the data in those xml
to my
table.. is there a way to do so in postgres?
Thanks,
Anto.Hello Anto,
I used this tool :
https://github.com/Networks-Learning/stackexchange-dump-to-postgresRegards,
--
Adrien NAYRAT
https://blog.anayrat.info
On 06/24/2018 06:07 PM, Anto Aravinth wrote:
Thanks for the response. I'm not sure, how long does this tool takes for the
70GB data.
In my memory, it took several hours. I can't remember if it is xml conversion or
insert which are longer.
I used node to stream the xml files into inserts.. which was very slow..
Actually the xml contains 40 million records, out of which 10Million took around
2 hrs using nodejs. Hence, I thought will use COPY command, as suggested on the
internet.Definitely, will try the code and let you know.. But looks like it uses the same
INSERT, not copy.. interesting if it runs quick on my machine.
Yes it use INSERT, maybe it is not difficult to change the code to use COPY instead.
--
Adrien NAYRAT
https://blog.anayrat.info
On 06/24/2018 08:25 AM, Anto Aravinth wrote:
Hello Everyone,
I have downloaded the Stackoverflow posts xml (contains all SO questions
till date).. the file is around 70GB.. I wanna import the data in those
xml to my table.. is there a way to do so in postgres?
It is going to require some work. You will need to deal with:
1) The row schema inside the XML is here:
https://ia800107.us.archive.org/27/items/stackexchange/readme.txt
- **posts**.xml
2) The rows are inside a <posts> tag.
Seems to me you have two options:
1) Drop each row into a single XML field and deal with extracting the
row components in the database.
2) Break down the row into column components before entering them into
the database.
Adrien has pointed you at a Python program that covers the above:
https://github.com/Networks-Learning/stackexchange-dump-to-postgres
If you are comfortable in Python you can take a look at:
https://github.com/Networks-Learning/stackexchange-dump-to-postgres/blob/master/row_processor.py
to see how the rows are broken down into elements.
I would try this out first on one of the smaller datasets found here:
https://archive.org/details/stackexchange
I personally took a look at:
https://archive.org/download/stackexchange/beer.stackexchange.com.7z
because why not?
Thanks,
Anto.
--
Adrian Klaver
adrian.klaver@aklaver.com
## Adrien Nayrat (adrien.nayrat@anayrat.info):
I used this tool :
https://github.com/Networks-Learning/stackexchange-dump-to-postgres
That will be awfully slow: this tool commits each INSERT on it's own,
see loop in
https://github.com/Networks-Learning/stackexchange-dump-to-postgres/blob/master/load_into_pg.py#L83
With only small changes - prepare the INSERT, execute for all (or at
least a huge lot of) rows, COMMIT at the end - you can safe quite a lot
of time (500 rows is not "a huge lot"). And when you do that, for
heaven's sake, do not try to create the INSERT statement as a string
with the values - Bobby Tables will eat you. See psycopg documentation
on how it's done (especially watch the warnings):
http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries
On prepared statements with psycopg2, see
http://initd.org/psycopg/articles/2012/10/01/prepared-statements-psycopg/
(python makes stuff rather hard, but I'll spare you the snark and wait until
the language has matured to at least version 5).
Using the COPY protocol with psycopg2 seems to require some hoop-jumping,
but could improve matters even more.
Regards,
Christoph
--
Spare Space.
Anto Aravinth <anto.aravinth.cse@gmail.com> writes:
Thanks for the response. I'm not sure, how long does this tool takes for
the 70GB data.I used node to stream the xml files into inserts.. which was very slow..
Actually the xml contains 40 million records, out of which 10Million took
around 2 hrs using nodejs. Hence, I thought will use COPY command, as
suggested on the internet.Definitely, will try the code and let you know.. But looks like it uses the
same INSERT, not copy.. interesting if it runs quick on my machine.On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info>
wrote:On 06/24/2018 05:25 PM, Anto Aravinth wrote:
Hello Everyone,
I have downloaded the Stackoverflow posts xml (contains all SO questions
till
date).. the file is around 70GB.. I wanna import the data in those xml
to my
table.. is there a way to do so in postgres?
Thanks,
Anto.Hello Anto,
I used this tool :
https://github.com/Networks-Learning/stackexchange-dump-to-postgres
If you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a database which is
also performing other processing.
We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.
Be wary of what indexes your defining on your table. Depending on the
type and number, these can have significant impact on insert times as
well.
--
Tim Cross
On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross <theophilusx@gmail.com> wrote:
Anto Aravinth <anto.aravinth.cse@gmail.com> writes:
Thanks for the response. I'm not sure, how long does this tool takes for
the 70GB data.I used node to stream the xml files into inserts.. which was very slow..
Actually the xml contains 40 million records, out of which 10Million took
around 2 hrs using nodejs. Hence, I thought will use COPY command, as
suggested on the internet.Definitely, will try the code and let you know.. But looks like it uses
the
same INSERT, not copy.. interesting if it runs quick on my machine.
On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <
adrien.nayrat@anayrat.info>
wrote:
On 06/24/2018 05:25 PM, Anto Aravinth wrote:
Hello Everyone,
I have downloaded the Stackoverflow posts xml (contains all SO
questions
till
date).. the file is around 70GB.. I wanna import the data in those xml
to my
table.. is there a way to do so in postgres?
Thanks,
Anto.Hello Anto,
I used this tool :
https://github.com/Networks-Learning/stackexchange-dump-to-postgresIf you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a database
which is
also performing other processing.We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.
Sure, let me try that.. I have a question here, COPY usually works when you
move data from files to your postgres instance, right? Now in node.js,
processing the whole file, can I use COPY
programmatically like COPY Stackoverflow <calculated value at run time>?
Because from doc:
https://www.postgresql.org/docs/9.2/static/sql-copy.html
I don't see its possible. May be I need to convert the files to copy
understandable first?
Anto.
Show quoted text
Be wary of what indexes your defining on your table. Depending on the
type and number, these can have significant impact on insert times as
well.--
Tim Cross
On Mon, 25 Jun 2018 at 11:38, Anto Aravinth <anto.aravinth.cse@gmail.com>
wrote:
On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross <theophilusx@gmail.com> wrote:
Anto Aravinth <anto.aravinth.cse@gmail.com> writes:
Thanks for the response. I'm not sure, how long does this tool takes for
the 70GB data.I used node to stream the xml files into inserts.. which was very slow..
Actually the xml contains 40 million records, out of which 10Milliontook
around 2 hrs using nodejs. Hence, I thought will use COPY command, as
suggested on the internet.Definitely, will try the code and let you know.. But looks like it uses
the
same INSERT, not copy.. interesting if it runs quick on my machine.
On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <
adrien.nayrat@anayrat.info>
wrote:
On 06/24/2018 05:25 PM, Anto Aravinth wrote:
Hello Everyone,
I have downloaded the Stackoverflow posts xml (contains all SO
questions
till
date).. the file is around 70GB.. I wanna import the data in those
xml
to my
table.. is there a way to do so in postgres?
Thanks,
Anto.Hello Anto,
I used this tool :
https://github.com/Networks-Learning/stackexchange-dump-to-postgresIf you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a
database which is
also performing other processing.We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.Sure, let me try that.. I have a question here, COPY usually works when
you move data from files to your postgres instance, right? Now in node.js,
processing the whole file, can I use COPY
programmatically like COPY Stackoverflow <calculated value at run time>?
Because from doc:https://www.postgresql.org/docs/9.2/static/sql-copy.html
I don't see its possible. May be I need to convert the files to copy
understandable first?Anto.
Yes. Essentially what you do is create a stream and feed whatever
information you want to copy into that stream. PG sees the. data as if it
was seeing each line in a file, so you push data onto the stream wherre
each item is seperated by a tab (or whatever). Here is the basic low level
function I use (Don't know how the formatting will go!)
async function copyInsert(sql, stringifyFN, records) {
const logName = `${moduleName}.copyInsert`;
var client;
assert.ok(Array.isArray(records), "The records arg must be an array");
assert.ok(typeof(stringifyFN) === "function", "The stringifyFN arg must
be a function");
return getClient()
.then(c => {
client = c;
return new Promise(function(resolve, reject) {
var stream, rs;
var idx = 0;
function done() {
releaseClient(client);
client = undefined;
resolve(idx + 1);
}
function onError(err) {
if (client !== undefined) {
releaseClient(client);
}
reject(new VError(err, `${logName}: COPY failed at record
${idx}`));
}
function arrayRead() {
if (idx === records.length) {
rs.push(null);
} else {
let rec = records[idx];
rs.push(stringifyFN(rec));
idx += 1;
}
}
rs = new Readable;
rs._read = arrayRead;
rs.on("error", onError);
stream = client.query(copyFrom(sql));
stream.on("error", onError);
stream.on("end", done);
rs.pipe(stream);
});
})
.catch(err => {
throw new VError(err, `${logName} Failed COPY insert`);
});
}
and I will call it like
copyInsert(sql, stringifyClimateRecord, records)
where sql and stringifycomateRecord arguments are
const sql = `COPY access_s.climate_data_ensemble_${ensemble} `
+ "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds,"
+ "vprp_09,vprp_15,wind_speed) FROM STDIN";
function stringifyClimateRecord(rec) {
return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t`
+
`${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`;
}
The stringifyClimateRecord returns a record to be inserted as a 'line' into
the stream with values separated by tabs. Records is an array of data
records where each record is an array.
--
regards,
Tim
--
Tim Cross
## Anto Aravinth (anto.aravinth.cse@gmail.com):
Sure, let me try that.. I have a question here, COPY usually works when you
move data from files to your postgres instance, right? Now in node.js,
processing the whole file, can I use COPY
programmatically like COPY Stackoverflow <calculated value at run time>?
Because from doc:https://www.postgresql.org/docs/9.2/static/sql-copy.html
I don't see its possible. May be I need to convert the files to copy
understandable first?
"COPY ... FROM STDIN"
STDIN Specifies that input comes from the client application.
It's on the page...
Regards,
Christoph
--
Spare Space.
Thanks a lot. But I do got lot of challenges! Looks like SO data contains
lot of tabs within itself.. So tabs delimiter didn't work for me. I thought
I can give a special demiliter but looks like Postrgesql copy allow only
one character as delimiter :(
Sad, I guess only way is to insert or do a through serialization of my data
into something that COPY can understand.
On Mon, Jun 25, 2018 at 8:09 AM, Tim Cross <theophilusx@gmail.com> wrote:
Show quoted text
On Mon, 25 Jun 2018 at 11:38, Anto Aravinth <anto.aravinth.cse@gmail.com>
wrote:On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross <theophilusx@gmail.com> wrote:
Anto Aravinth <anto.aravinth.cse@gmail.com> writes:
Thanks for the response. I'm not sure, how long does this tool takes
for
the 70GB data.
I used node to stream the xml files into inserts.. which was very
slow..
Actually the xml contains 40 million records, out of which 10Million
took
around 2 hrs using nodejs. Hence, I thought will use COPY command, as
suggested on the internet.Definitely, will try the code and let you know.. But looks like it
uses the
same INSERT, not copy.. interesting if it runs quick on my machine.
On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <
adrien.nayrat@anayrat.info>
wrote:
On 06/24/2018 05:25 PM, Anto Aravinth wrote:
Hello Everyone,
I have downloaded the Stackoverflow posts xml (contains all SO
questions
till
date).. the file is around 70GB.. I wanna import the data in those
xml
to my
table.. is there a way to do so in postgres?
Thanks,
Anto.Hello Anto,
I used this tool :
https://github.com/Networks-Learning/stackexchange-dump-to-postgresIf you are using nodejs, then you can easily use the pg-copy-streams
module to insert the records into your database. I've been using this
for inserting large numbers of records from NetCDF files. Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and into a
database which is
also performing other processing.We found a significant speed improvement with COPY over blocks of insert
transactions, which was faster than just individual inserts. The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work out which
record is causing the failure. A benefit of using blocks of transactions
is that you have more fine grained control, allowing you to recover from
some errors or providing more specific detail regarding the cause of the
error.Sure, let me try that.. I have a question here, COPY usually works when
you move data from files to your postgres instance, right? Now in node.js,
processing the whole file, can I use COPY
programmatically like COPY Stackoverflow <calculated value at run time>?
Because from doc:https://www.postgresql.org/docs/9.2/static/sql-copy.html
I don't see its possible. May be I need to convert the files to copy
understandable first?Anto.
Yes. Essentially what you do is create a stream and feed whatever
information you want to copy into that stream. PG sees the. data as if it
was seeing each line in a file, so you push data onto the stream wherre
each item is seperated by a tab (or whatever). Here is the basic low level
function I use (Don't know how the formatting will go!)async function copyInsert(sql, stringifyFN, records) {
const logName = `${moduleName}.copyInsert`;
var client;assert.ok(Array.isArray(records), "The records arg must be an array");
assert.ok(typeof(stringifyFN) === "function", "The stringifyFN arg must
be a function");return getClient()
.then(c => {
client = c;
return new Promise(function(resolve, reject) {
var stream, rs;
var idx = 0;function done() {
releaseClient(client);
client = undefined;
resolve(idx + 1);
}function onError(err) {
if (client !== undefined) {
releaseClient(client);
}
reject(new VError(err, `${logName}: COPY failed at record
${idx}`));
}function arrayRead() {
if (idx === records.length) {
rs.push(null);
} else {
let rec = records[idx];
rs.push(stringifyFN(rec));
idx += 1;
}
}rs = new Readable;
rs._read = arrayRead;
rs.on("error", onError);
stream = client.query(copyFrom(sql));
stream.on("error", onError);
stream.on("end", done);
rs.pipe(stream);
});
})
.catch(err => {
throw new VError(err, `${logName} Failed COPY insert`);
});
}and I will call it like
copyInsert(sql, stringifyClimateRecord, records)
where sql and stringifycomateRecord arguments are
const sql = `COPY access_s.climate_data_ensemble_${ensemble} ` + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds," + "vprp_09,vprp_15,wind_speed) FROM STDIN";function stringifyClimateRecord(rec) {
return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t`
+ `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${
rec[9]}\n`;
}The stringifyClimateRecord returns a record to be inserted as a 'line'
into the stream with values separated by tabs. Records is an array of data
records where each record is an array.--
regards,Tim
--
Tim Cross
2018-06-25 16:25 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:
Thanks a lot. But I do got lot of challenges! Looks like SO data contains
lot of tabs within itself.. So tabs delimiter didn't work for me. I thought
I can give a special demiliter but looks like Postrgesql copy allow only
one character as delimiter :(Sad, I guess only way is to insert or do a through serialization of my
data into something that COPY can understand.
easiest way would be:
xml -> csv -> \copy
by csv, I mean regular quoted csv (Simply wrap csv field with double
quote, and escape
enventually contained quotes with an other double quote.).
Postgresql copy csv parser is one of the most robust I ever tested
before.
On Mon, Jun 25, 2018 at 8:20 PM, Nicolas Paris <niparisco@gmail.com> wrote:
2018-06-25 16:25 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:
Thanks a lot. But I do got lot of challenges! Looks like SO data contains
lot of tabs within itself.. So tabs delimiter didn't work for me. I thought
I can give a special demiliter but looks like Postrgesql copy allow only
one character as delimiter :(Sad, I guess only way is to insert or do a through serialization of my
data into something that COPY can understand.easiest way would be:
xml -> csv -> \copyby csv, I mean regular quoted csv (Simply wrap csv field with double
quote, and escape
enventually contained quotes with an other double quote.).
I tried but no luck. Here is the sample csv, I wrote from my xml convertor:
1 "Are questions about animations or comics inspired by Japanese
culture or styles considered on-topic?" "pExamples include a href=""
http://www.imdb.com/title/tt0417299/"" rel=""nofollow""Avatar/a, a href=""
http://www.imdb.com/title/tt1695360/"" rel=""nofollow""Korra/a and, to some
extent, a href=""http://www.imdb.com/title/tt0278238/""
rel=""nofollow""Samurai Jack/a. They're all widely popular American
cartoons, sometimes even referred to as ema href=""
https://en.wikipedia.org/wiki/Anime-influenced_animation""
rel=""nofollow""Amerime/a/em./p
pAre questions about these series on-topic?/p
" "pExamples include a href=""http://www.imdb.com/title/tt0417299/""
rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/title/tt1695360/""
rel=""nofollow""Korra/a and, to some extent, a href=""
http://www.imdb.com/title/tt0278238/"" rel=""nofollow""Samurai Jack/a.
They're all widely popular American cartoons, sometimes even referred to as
ema href=""https://en.wikipedia.org/wiki/Anime-influenced_animation""
rel=""nofollow""Amerime/a/em./p
pAre questions about these series on-topic?/p
" "null"
the schema of my table is:
CREATE TABLE so2 (
id INTEGER NOT NULL PRIMARY KEY,
title varchar(1000) NULL,
posts text,
body TSVECTOR,
parent_id INTEGER NULL,
FOREIGN KEY (parent_id) REFERENCES so1(id)
);
and when I run:
COPY so2 from '/Users/user/programs/js/node-mbox/file.csv';
I get:
CONTEXT: COPY so2, line 1: "1 "Are questions about animations or comics
inspired by Japanese culture or styles considered on-top..."
Not sure what I'm missing. Not sure the above csv is breaking because I
have newlines within my content. But the error message is very hard to
debug.
Show quoted text
Postgresql copy csv parser is one of the most robust I ever tested
before.
On Mon, Jun 25, 2018 at 8:54 PM, Anto Aravinth <anto.aravinth.cse@gmail.com>
wrote:
On Mon, Jun 25, 2018 at 8:20 PM, Nicolas Paris <niparisco@gmail.com>
wrote:2018-06-25 16:25 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:
Thanks a lot. But I do got lot of challenges! Looks like SO data
contains lot of tabs within itself.. So tabs delimiter didn't work for me.
I thought I can give a special demiliter but looks like Postrgesql copy
allow only one character as delimiter :(Sad, I guess only way is to insert or do a through serialization of my
data into something that COPY can understand.easiest way would be:
xml -> csv -> \copyby csv, I mean regular quoted csv (Simply wrap csv field with double
quote, and escape
enventually contained quotes with an other double quote.).I tried but no luck. Here is the sample csv, I wrote from my xml convertor:
1 "Are questions about animations or comics inspired by Japanese
culture or styles considered on-topic?" "pExamples include a href=""
http://www.imdb.com/title/tt0417299/"" rel=""nofollow""Avatar/a, a href=""
http://www.imdb.com/title/tt1695360/"" rel=""nofollow""Korra/a and, to
some extent, a href=""http://www.imdb.com/title/tt0278238/""
rel=""nofollow""Samurai Jack/a. They're all widely popular American
cartoons, sometimes even referred to as ema href=""https://en.wikipedia.
org/wiki/Anime-influenced_animation"" rel=""nofollow""Amerime/a/em./ppAre questions about these series on-topic?/p
" "pExamples include a href=""http://www.imdb.com/title/tt0417299/""
rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/title/tt1695360/""
rel=""nofollow""Korra/a and, to some extent, a href=""http://www.imdb.com/
title/tt0278238/"" rel=""nofollow""Samurai Jack/a. They're all widely
popular American cartoons, sometimes even referred to as ema href=""
https://en.wikipedia.org/wiki/Anime-influenced_animation""
rel=""nofollow""Amerime/a/em./ppAre questions about these series on-topic?/p
" "null"
the schema of my table is:
CREATE TABLE so2 (
id INTEGER NOT NULL PRIMARY KEY,
title varchar(1000) NULL,
posts text,
body TSVECTOR,
parent_id INTEGER NULL,
FOREIGN KEY (parent_id) REFERENCES so1(id)
);and when I run:
COPY so2 from '/Users/user/programs/js/node-mbox/file.csv';
I get:
*ERROR: missing data for column "body"*
*CONTEXT: COPY so2, line 1: "1 "Are questions about animations or comics
inspired by Japanese culture or styles considered on-top..."*
Show quoted text
CONTEXT: COPY so2, line 1: "1 "Are questions about animations or comics
inspired by Japanese culture or styles considered on-top..."Not sure what I'm missing. Not sure the above csv is breaking because I
have newlines within my content. But the error message is very hard to
debug.Postgresql copy csv parser is one of the most robust I ever tested
before.
2018-06-25 17:30 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:
On Mon, Jun 25, 2018 at 8:54 PM, Anto Aravinth <
anto.aravinth.cse@gmail.com> wrote:On Mon, Jun 25, 2018 at 8:20 PM, Nicolas Paris <niparisco@gmail.com>
wrote:2018-06-25 16:25 GMT+02:00 Anto Aravinth <anto.aravinth.cse@gmail.com>:
Thanks a lot. But I do got lot of challenges! Looks like SO data
contains lot of tabs within itself.. So tabs delimiter didn't work for me.
I thought I can give a special demiliter but looks like Postrgesql copy
allow only one character as delimiter :(Sad, I guess only way is to insert or do a through serialization of my
data into something that COPY can understand.easiest way would be:
xml -> csv -> \copyby csv, I mean regular quoted csv (Simply wrap csv field with double
quote, and escape
enventually contained quotes with an other double quote.).I tried but no luck. Here is the sample csv, I wrote from my xml
convertor:1 "Are questions about animations or comics inspired by Japanese
culture or styles considered on-topic?" "pExamples include a href=""
http://www.imdb.com/title/tt0417299/"" rel=""nofollow""Avatar/a, a
href=""http://www.imdb.com/title/tt1695360/"" rel=""nofollow""Korra/a
and, to some extent, a href=""http://www.imdb.com/title/tt0278238/""
rel=""nofollow""Samurai Jack/a. They're all widely popular American
cartoons, sometimes even referred to as ema href=""
https://en.wikipedia.org/wiki/Anime-influenced_animation""
rel=""nofollow""Amerime/a/em./ppAre questions about these series on-topic?/p
" "pExamples include a href=""http://www.imdb.com/title/tt0417299/""
rel=""nofollow""Avatar/a, a href=""http://www.imdb.com/title/tt1695360/""
rel=""nofollow""Korra/a and, to some extent, a href=""
http://www.imdb.com/title/tt0278238/"" rel=""nofollow""Samurai Jack/a.
They're all widely popular American cartoons, sometimes even referred to as
ema href=""https://en.wikipedia.org/wiki/Anime-influenced_animation""
rel=""nofollow""Amerime/a/em./ppAre questions about these series on-topic?/p
" "null"
the schema of my table is:
CREATE TABLE so2 (
id INTEGER NOT NULL PRIMARY KEY,
title varchar(1000) NULL,
posts text,
body TSVECTOR,
parent_id INTEGER NULL,
FOREIGN KEY (parent_id) REFERENCES so1(id)
);and when I run:
COPY so2 from '/Users/user/programs/js/node-mbox/file.csv';
I get:
*ERROR: missing data for column "body"*
*CONTEXT: COPY so2, line 1: "1 "Are questions about animations or comics
inspired by Japanese culture or styles considered on-top..."*CONTEXT: COPY so2, line 1: "1 "Are questions about animations or comics
inspired by Japanese culture or styles considered on-top..."Not sure what I'm missing. Not sure the above csv is breaking because I
have newlines within my content. But the error message is very hard to
debug.
What you are missing is the configuration of COPY statement (please refer
to https://www.postgresql.org/docs/9.2/static/sql-copy.html)
such format, delimiter, quote and escape
On 06/25/2018 07:25 AM, Anto Aravinth wrote:
Thanks a lot. But I do got lot of challenges! Looks like SO data
contains lot of tabs within itself.. So tabs delimiter didn't work for
me. I thought I can give a special demiliter but looks like Postrgesql
copy allow only one character as delimiter :(
I use | as it is rarely found in data itself.
Sad, I guess only way is to insert or do a through serialization of my
data into something that COPY can understand.On Mon, Jun 25, 2018 at 8:09 AM, Tim Cross <theophilusx@gmail.com
<mailto:theophilusx@gmail.com>> wrote:On Mon, 25 Jun 2018 at 11:38, Anto Aravinth
<anto.aravinth.cse@gmail.com <mailto:anto.aravinth.cse@gmail.com>>
wrote:On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross
<theophilusx@gmail.com <mailto:theophilusx@gmail.com>> wrote:Anto Aravinth <anto.aravinth.cse@gmail.com
<mailto:anto.aravinth.cse@gmail.com>> writes:Thanks for the response. I'm not sure, how long does this tool takes for
the 70GB data.I used node to stream the xml files into inserts.. which was very slow..
Actually the xml contains 40 million records, out of which 10Million took
around 2 hrs using nodejs. Hence, I thought will use COPY command, as
suggested on the internet.Definitely, will try the code and let you know.. But looks like it uses the
same INSERT, not copy.. interesting if it runs quick on my machine.On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info <mailto:adrien.nayrat@anayrat.info>>
wrote:On 06/24/2018 05:25 PM, Anto Aravinth wrote:
Hello Everyone,
I have downloaded the Stackoverflow posts xml (contains all SO questions
till
date).. the file is around 70GB.. I wanna import the data in those xml
to my
table.. is there a way to do so in postgres?
Thanks,
Anto.Hello Anto,
I used this tool :
https://github.com/Networks-Learning/stackexchange-dump-to-postgres<https://github.com/Networks-Learning/stackexchange-dump-to-postgres>
If you are using nodejs, then you can easily use the
pg-copy-streams
module to insert the records into your database. I've been
using this
for inserting large numbers of records from NetCDF files.
Takes between
40 to 50 minutes to insert 60 Million+ records and we are doing
additional calculations on the values, not just inserting them,
plus we are inserting into a database over the network and
into a database which is
also performing other processing.We found a significant speed improvement with COPY over
blocks of insert
transactions, which was faster than just individual inserts.
The only
downside with using COPY is that it either completely works or
completely fails and when it fails, it can be tricky to work
out which
record is causing the failure. A benefit of using blocks of
transactions
is that you have more fine grained control, allowing you to
recover from
some errors or providing more specific detail regarding the
cause of the
error.Sure, let me try that.. I have a question here, COPY usually
works when you move data from files to your postgres instance,
right? Now in node.js, processing the whole file, can I use COPY
programmatically like COPY Stackoverflow <calculated value at
run time>? Because from doc:https://www.postgresql.org/docs/9.2/static/sql-copy.html
<https://www.postgresql.org/docs/9.2/static/sql-copy.html>I don't see its possible. May be I need to convert the files to
copy understandable first?Anto.
Yes. Essentially what you do is create a stream and feed whatever
information you want to copy into that stream. PG sees the. data as
if it was seeing each line in a file, so you push data onto the
stream wherre each item is seperated by a tab (or whatever). Here is
the basic low level function I use (Don't know how the formatting
will go!)async function copyInsert(sql, stringifyFN, records) {
const logName = `${moduleName}.copyInsert`;
var client;assert.ok(Array.isArray(records), "The records arg must be an
array");
assert.ok(typeof(stringifyFN) === "function", "The stringifyFN
arg must be a function");
return getClient()
.then(c => {
client = c;
return new Promise(function(resolve, reject) {
var stream, rs;
var idx = 0;
function done() {
releaseClient(client);
client = undefined;
resolve(idx + 1);
}function onError(err) {
if (client !== undefined) {
releaseClient(client);
}
reject(new VError(err, `${logName}: COPY failed at record
${idx}`));
}function arrayRead() {
if (idx === records.length) {
rs.push(null);
} else {
let rec = records[idx];
rs.push(stringifyFN(rec));
idx += 1;
}
}rs = new Readable;
rs._read = arrayRead;
rs.on("error", onError);
stream = client.query(copyFrom(sql));
stream.on("error", onError);
stream.on("end", done);
rs.pipe(stream);
});
})
.catch(err => {
throw new VError(err, `${logName} Failed COPY insert`);
});
}and I will call it like
copyInsert(sql, stringifyClimateRecord, records)
where sql and stringifycomateRecord arguments are
const sql = `COPY access_s.climate_data_ensemble_${ensemble} `
+ "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds,"
+ "vprp_09,vprp_15,wind_speed) FROM STDIN";function stringifyClimateRecord(rec) {
return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t`
+
`${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`;
}The stringifyClimateRecord returns a record to be inserted as a
'line' into the stream with values separated by tabs. Records is an
array of data records where each record is an array.--
regards,Tim
--
Tim Cross
--
Adrian Klaver
adrian.klaver@aklaver.com
Anto Aravinth <anto.aravinth.cse@gmail.com> writes:
Thanks a lot. But I do got lot of challenges! Looks like SO data contains
lot of tabs within itself.. So tabs delimiter didn't work for me. I thought
I can give a special demiliter but looks like Postrgesql copy allow only
one character as delimiter :(Sad, I guess only way is to insert or do a through serialization of my data
into something that COPY can understand.
The COPY command has a number of options, including setting what is used
as the delimiter - it doesn't have to be tab. You need to also look at
the logs/output to see exactly why the copy fails.
I'd recommend first pre-processing your input data to make sure it is
'clean' and all the fields actually match with whatever DDL you have
used to define your db tables etc. I'd then select a small subset and
try different parameters to the copy command until you get the right
combination of data format and copy definition.
It may take some effort to get the right combination, but the result is
probably worth it given your data set size i.e. difference between hours
and days.
--
Tim Cross
Thanks a lot everyone. After playing around with small dataset, I could
able to make datasets that are easy to go with COPY. Creating datasets of
around 50GB took say 2hrs (I can definitely improve on this).
54M records, COPY took around 35 minutes! Awesome.. :) :)
Mean time, I understood few things like vacuum etc.
Really loving postgres!
Thanks,
Anto.
On Tue, Jun 26, 2018 at 3:40 AM, Tim Cross <theophilusx@gmail.com> wrote:
Show quoted text
Anto Aravinth <anto.aravinth.cse@gmail.com> writes:
Thanks a lot. But I do got lot of challenges! Looks like SO data contains
lot of tabs within itself.. So tabs delimiter didn't work for me. Ithought
I can give a special demiliter but looks like Postrgesql copy allow only
one character as delimiter :(Sad, I guess only way is to insert or do a through serialization of my
data
into something that COPY can understand.
The COPY command has a number of options, including setting what is used
as the delimiter - it doesn't have to be tab. You need to also look at
the logs/output to see exactly why the copy fails.I'd recommend first pre-processing your input data to make sure it is
'clean' and all the fields actually match with whatever DDL you have
used to define your db tables etc. I'd then select a small subset and
try different parameters to the copy command until you get the right
combination of data format and copy definition.It may take some effort to get the right combination, but the result is
probably worth it given your data set size i.e. difference between hours
and days.--
Tim Cross