large xml database

Started by Viktor Bojovićover 15 years ago8 messagesgeneral
Jump to latest
#1Viktor Bojović
viktor.bojovic@gmail.com

Hi,
i have very big XML documment which is larger than 50GB and want to import
it into databse, and transform it to relational schema.
When splitting this documment to smaller independent xml documments i get
~11.1mil XML documents.
I have spent lots of time trying to get fastest way to transform all this
data but every time i give up because it takes too much time. Sometimes more
than month it would take if not stopped.
I have tried to insert each line as varchar into database and parse it using
plperl regex..
also i have tried to store every documment as XML and parse it, but it is
also to slow.
i have tried to store every documment as varchar but it is also slow when
using regex to get data.

many tries have failed because 8GB of ram and 10gb of swap were not enough.
also sometimes i get that more than 2^32 operations were performed, and
functions stopped to work.

i wanted just to ask if someone knows how to speed this up.

thanx in advance
--
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me

#2Mike Christensen
mike@kitchenpc.com
In reply to: Viktor Bojović (#1)
Re: large xml database

Geeeeeeeez.

Maybe you can lease a bunch of Amazon EC2 high computing slices and
parallelize it? I think throwing ridiculous amounts of hardware at
things is always the best approach.

On Sat, Oct 30, 2010 at 2:48 PM, Viktor Bojović
<viktor.bojovic@gmail.com> wrote:

Show quoted text

Hi,
i have very big XML documment which is larger than 50GB and want to import
it into databse, and transform it to relational schema.
When splitting this documment to smaller independent xml documments i get
~11.1mil XML documents.
I have spent lots of time trying to get fastest way to transform all this
data but every time i give up because it takes too much time. Sometimes more
than month it would take if not stopped.
I have tried to insert each line as varchar into database and parse it using
plperl regex..
also i have tried to store every documment  as XML and parse it, but it is
also to slow.
i have tried to store every documment as varchar but it is also slow when
using regex to get data.
many tries have failed because 8GB of ram and 10gb of swap were not enough.
also sometimes i get that more than 2^32 operations  were performed, and
functions stopped to work.
i wanted just to ask if someone knows how to speed this up.

thanx in advance
--
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me

#3Andy Colson
andy@squeakycode.net
In reply to: Viktor Bojović (#1)
Re: large xml database

On 10/30/2010 4:48 PM, Viktor Bojović wrote:

Hi,
i have very big XML documment which is larger than 50GB and want to
import it into databse, and transform it to relational schema.
When splitting this documment to smaller independent xml documments i
get ~11.1mil XML documents.
I have spent lots of time trying to get fastest way to transform all
this data but every time i give up because it takes too much time.
Sometimes more than month it would take if not stopped.
I have tried to insert each line as varchar into database and parse it
using plperl regex..
also i have tried to store every documment as XML and parse it, but it
is also to slow.
i have tried to store every documment as varchar but it is also slow
when using regex to get data.

many tries have failed because 8GB of ram and 10gb of swap were not
enough. also sometimes i get that more than 2^32 operations were
performed, and functions stopped to work.

i wanted just to ask if someone knows how to speed this up.

thanx in advance
--
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me

It might help a little to know a few more detail. Like what is in the
xml doc. When you say convert to relational, do you mean multiple
tables (and no more xml tags), or do you mean a table with a blob column
that contains some xml fragment?

I have imported millions of rows and never run out of memory. The
database will take care of itself unless you are doing something really
bad. I'd guess its the xml parser running out of ram and not the
database. Are you using dom or sax?

You say it took too much time. What did? The xml parsing? The
database inserts? Were you cpu bound or io bound?

What tools are you using to write this in? What OS are you on? What
version of PG? You know... just a "few" more details :-)

-Andy

#4Ozz Nixon
ozznixon@gmail.com
In reply to: Mike Christensen (#2)
DATA Location

Is it possible (and how) to implement a data path on another partition (linux) for an existing system? And then if I do not gain anything, merging it to the production /data path?

Scenario of what I want to achieve (/mnt/data is already running)

/mnt/data resides on an NFS share
Contains over 2 Billion web sites crawled (yeah another search-engine site)

want to add:

/opt/data resides on internal drive
Will contain keyword hash system

Then if I find this does not improve anything - or runs tight (running on IBM Blade center with 76gb internals - so I may be limited), that I can simple shutdown postgres, scp /opt/data/folder/ to the NFS - bring up postgres - fix any conf file, and everything is on the multi-terabyte array?

Checking the pulse of all you speed freaks that just hit the floor because this is on NFS not iSCSI -- don't worry, when this site goes live it will be iSCSI. We have been in private development for 3 years... and the NFS still out runs the Internet connections.

In reply to: Ozz Nixon (#4)
Re: DATA Location

On 28/12/2010 14:56, Ozz Nixon wrote:

Is it possible (and how) to implement a data path on another
partition (linux) for an existing system? And then if I do not gain
anything, merging it to the production /data path?

I think tablespaces will do what you want:

http://www.postgresql.org/docs/9.0/static/manage-ag-tablespaces.html

HTH,

Ray.

--
Raymond O'Donnell :: Galway :: Ireland
rod@iol.ie

#6Andreas Kretschmer
akretschmer@spamfence.net
In reply to: Ozz Nixon (#4)
Re: DATA Location

Ozz Nixon <ozznixon@gmail.com> wrote:

Is it possible (and how) to implement a data path on another partition
(linux) for an existing system? And then if I do not gain anything,
merging it to the production /data path?

I think, you should read our docu about tablespaces:

http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html

Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect. (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly." (unknown)
Kaufbach, Saxony, Germany, Europe. N 51.05082�, E 13.56889�

#7Scott Marlowe
scott.marlowe@gmail.com
In reply to: Ozz Nixon (#4)
Re: DATA Location

On Tue, Dec 28, 2010 at 7:56 AM, Ozz Nixon <ozznixon@gmail.com> wrote:

Is it possible (and how) to implement a data path on another partition (linux) for an existing system? And then if I do not gain anything, merging it to the production /data path?

Scenario of what I want to achieve (/mnt/data is already running)

       /mnt/data       resides on an NFS share
                               Contains over 2 Billion web sites crawled (yeah another search-engine site)

want to add:

       /opt/data               resides on internal drive
                               Will contain keyword hash system

Then if I find this does not improve anything - or runs tight (running on IBM Blade center with 76gb internals - so I may be limited), that I can simple shutdown postgres, scp /opt/data/folder/ to the NFS - bring up postgres - fix any conf file, and everything is on the multi-terabyte array?

You can't just merge two pgsql databases into one directory. You have
to pick one to run and dump the other and load it into the other db
with pg_dump / psql / pg_restore. If you completely replace the db
instance in /opt/data/folder with the other one that will work, but
erase the original db that was there.

Checking the pulse of all you speed freaks that just hit the floor because this is on NFS not iSCSI -- don't worry, when this site goes live it will be iSCSI. We have been in private development for 3 years... and the NFS still out runs the Internet connections.

It's more an issue of reliability in case of crashes and power losses
than anything else. most iSCSI implementations will honor fsync
properly, but many nfs mounts will lose data / corrupt your data store
if one or the other machine crashes in the wrong way.

#8Jasen Betts
jasen@xnet.co.nz
In reply to: Viktor Bojović (#1)
Re: DATA Location

On 2010-12-28, Ozz Nixon <ozznixon@gmail.com> wrote:

Is it possible (and how) to implement a data path on another partition (linux) for an existing system? And then if I do not gain anything, merging it to the production /data path?

tablespaces

Scenario of what I want to achieve (/mnt/data is already running)

/mnt/data resides on an NFS share
Contains over 2 Billion web sites crawled (yeah another search-engine site)

using postgres on nfs is not recommended,

want to add:

/opt/data resides on internal drive
Will contain keyword hash system

local is good.

Then if I find this does not improve anything - or runs tight
(running on IBM Blade center with 76gb internals - so I may be
limited), that I can simple shutdown postgres, scp /opt/data/folder/
to the NFS - bring up postgres - fix any conf file, and everything is
on the multi-terabyte array?

or leave postgres running and just copy the table using "select * into"
or similar.

it's possibly better to use iSCSI instead of NFS, it's not like you can share the data
directory.

--
⚂⚃ 100% natural