xml-file as foreign table?

Started by Johann Spiesalmost 10 years ago4 messagesgeneral
Jump to latest
#1Johann Spies
johann.spies@gmail.com

I have several large (7GB+) xml files to get into an SQL database.

The xml-files can contain up to 500 000 subrecords which I want to be able
to query in the database.

They are too large to do something like this:

insert into rawxml (xml) select XMLPARSE (DOCUMENT
CONVERT_FROM(PG_READ_BINARY_FILE('FOO.xml' ), 'UTF8'));

If it were possible, each file would be one huge record in the table which
can then be unpacked using XPATH.

The options I am considering is :

1. Unpack the individual records (will be more than 50 million) using
something like python with lxml and psycopg2 and insert them after dropping
all indexes and triggers on the table

2. Unpack the individual records and write a (very) large tsv-file and then
insert it using 'copy'

It would be convenient If I could use the present xml files as 'foreign
tables' and parse them using the xpath-capabilities of PostgreSQL.

Is this possible?

Of not, what would be the most efficient way of getting the individual
xml-records xml into a table into a xml-type field?

The documentation did not give me a clear answer.

Regards
Johann
--
Because experiencing your loyal love is better than life itself,
my lips will praise you. (Psalm 63:3)

#2Arjen Nienhuis
a.g.nienhuis@gmail.com
In reply to: Johann Spies (#1)
Re: xml-file as foreign table?

On Apr 28, 2016 14:33, "Johann Spies" <johann.spies@gmail.com> wrote:

I have several large (7GB+) xml files to get into an SQL database.

The xml-files can contain up to 500 000 subrecords which I want to be

able to query in the database.

They are too large to do something like this:

insert into rawxml (xml) select XMLPARSE (DOCUMENT

CONVERT_FROM(PG_READ_BINARY_FILE('FOO.xml' ), 'UTF8'));

If it were possible, each file would be one huge record in the table

which can then be unpacked using XPATH.

The options I am considering is :

1. Unpack the individual records (will be more than 50 million) using

something like python with lxml and psycopg2 and insert them after dropping
all indexes and triggers on the table

2. Unpack the individual records and write a (very) large tsv-file and

then insert it using 'copy'

The fastest way I found is to combine these two. Using iterparse from lxml
combined with load_rows and COPY from py-postgresql:

http://python.projects.pgfoundry.org/docs/1.1/driver.html#copy-statements

That way you can stream the data.

It would be convenient If I could use the present xml files as 'foreign

tables' and parse them using the xpath-capabilities of PostgreSQL.

Is this possible?

There is a multicorn fdw for that:

https://github.com/Kozea/Multicorn/blob/master/
<https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py&gt;
python
<https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py&gt;/
<https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py&gt;
multicorn
<https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py&gt;/
<https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py&gt;
xmlfdw.py
<https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py&gt;

But I never tried it. It looks like it loads all rows in a python list.

Groeten, Arjen

#3Arjen Nienhuis
a.g.nienhuis@gmail.com
In reply to: Johann Spies (#1)
Re: xml-file as foreign table?

On Thu, May 5, 2016 at 2:13 PM, Johann Spies <johann.spies@gmail.com> wrote:

Dankie Arjen,

On 29 April 2016 at 07:01, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote:

The options I am considering is :

1. Unpack the individual records (will be more than 50 million) using
something like python with lxml and psycopg2 and insert them after dropping
all indexes and triggers on the table

2. Unpack the individual records and write a (very) large tsv-file and
then insert it using 'copy'

The fastest way I found is to combine these two. Using iterparse from lxml
combined with load_rows and COPY from py-postgresql:

http://python.projects.pgfoundry.org/docs/1.1/driver.html#copy-statements

That way you can stream the data.

I did not know about py-postgresql as I am a Python-2.7 user.

I am excited with the possiilities python-postgresql is offering. After a
bit of experimenting to use the streaming-copy option I landed up in the
same problem that I had while using python2: Postgresql rejecting the
xml-part of the record.

How do you handle the conversion from string to bytes and back and the
presence of quotes within the xml?

I have tried this to experiment with just 10 records for a start:

ut = element.xpath('.//t:UID/text()',namespaces=namespaces)[0]
x = etree.tostring(element)
y = x.decode(encoding='UTF-8').replace("'", "''")
s = '\t'.join([gzipf,filename, ut,y])
t = s.encode(encoding='UTF-8')
rows.append(t)
count += 1
element.clear()
gc.collect()
if count == 10:
import pdb;pdb.set_trace()
xmlin.load_rows(rows)
rows = []
f.close()
exit(0)

Which ends up with an error:

postgresql.exceptions.XMLContentError: invalid XML content
CODE: 2200N
LOCATION: File 'xml.c', line 1551, in xml_ereport from SERVER
CONTEXT: COPY annual, line 1, column xml: "<REC
xmlns="http://scientific.thomsonreuters.com/schema/wok5.4/public/FullRecord&quot;
r_id_disclaimer="R..."
DETAIL: line 1: Premature end of data in tag REC line 1

If I need to guess it's because of \n in the xml. I used "WITH (FORMAT
csv)" and quote the fields:

def q(v):
return b'"' + v.replace(b'"', b'""') + b'"'

return b','.join(q(f) for f in fields) + b'\n'

In the end I also had some other problems with the XML (namespaces), so I used:

etree.tostring(element, method='c14n', exclusive=True)

With python2.7 I tried to use the same technique using subprocess and a call
to psql to pipe the data to Postgresql -- ending with the same error.

Maybe you can show a few lines of the output.

Dankie nogmaals vir die verwysing na python-postgresql.

Mooi dag.

Johann

Groeten, Arjen

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Johann Spies
johann.spies@gmail.com
In reply to: Arjen Nienhuis (#3)
Re: xml-file as foreign table?

Thanks Arjen,

def q(v):
return b'"' + v.replace(b'"', b'""') + b'"'

return b','.join(q(f) for f in fields) + b'\n'

In the end I also had some other problems with the XML (namespaces), so I
used:

etree.tostring(element, method='c14n', exclusive=True)

This helped. My code is now doing it's job.

Regards
Johann