WIP patch: add (PRE|POST)PROCESSOR options to COPY

Started by Etsuro Fujitaover 13 years ago45 messageshackers

fujita.etsuro@lab.ntt.co.jp

over 13 years ago

I'd like to add the following options to the SQL COPY command and the psql \copy
instruction:

* PREPROCESSOR: Specifies the user-supplied program for COPY IN. The data
from an input file is preprocessed by the program before the data is loaded into
a postgres table.
* POSTPROCESSOR: Specifies the user-supplied program for COPY OUT. The data
from a postgres table is postprocessed by the program before the data is stored
in an output file.

These options can be specified only when an input or output file is specified.

These options allow to move data between postgres tables and e.g., compressed
files or files on a distributed file system such as Hadoop HDFS. Former
examples are shown below:

$ echo '/bin/gunzip -c $1' > decompress.sh
$ chmod +x decompress.sh

postgres=# COPY foo FROM '/home/pgsql/foo.csv.gz' WITH (format 'csv',
preprocessor '/home/pgsql/decompress.sh');

$ echo '/bin/gzip > $1' > compress.sh
$ chmod +x compress.sh

postgres=# COPY bar TO '/home/pgsql/bar.csv.gz' WITH (format 'csv',
postprocessor '/home/pgsql/compress.sh');

Attached is a WIP patch. Comments and questions are welcome.

(By using these options, I think it's also possible to develop a variant of
file_fdw, for example a compressed file wrapper and Hadoop HDFS wrapper.)

Thanks,

Best regards,
Etsuro Fujita

Alvaro Herrera

alvherre@2ndquadrant.com

over 13 years ago

In reply to: Etsuro Fujita (#1)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

Excerpts from Etsuro Fujita's message of jue sep 13 06:13:26 -0300 2012:

I'd like to add the following options to the SQL COPY command and the psql \copy
instruction:

* PREPROCESSOR: Specifies the user-supplied program for COPY IN. The data
from an input file is preprocessed by the program before the data is loaded into
a postgres table.
* POSTPROCESSOR: Specifies the user-supplied program for COPY OUT. The data
from a postgres table is postprocessed by the program before the data is stored
in an output file.

External programs? I don't like the sound of that; there all kinds of
open questions, security concerns, and process management problems.
What if the pre- and postprocessors were functions instead?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Pavel Stehule

pavel.stehule@gmail.com

over 13 years ago

In reply to: Alvaro Herrera (#2)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

2012/9/13 Alvaro Herrera <alvherre@2ndquadrant.com>:

Excerpts from Etsuro Fujita's message of jue sep 13 06:13:26 -0300 2012:

I'd like to add the following options to the SQL COPY command and the psql \copy
instruction:

* PREPROCESSOR: Specifies the user-supplied program for COPY IN. The data
from an input file is preprocessed by the program before the data is loaded into
a postgres table.
* POSTPROCESSOR: Specifies the user-supplied program for COPY OUT. The data
from a postgres table is postprocessed by the program before the data is stored
in an output file.

External programs? I don't like the sound of that; there all kinds of
open questions, security concerns, and process management problems.
What if the pre- and postprocessors were functions instead?

this can be solved via pipe and outer processes

Pavel

Show quoted text

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Etsuro Fujita (#1)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

"Etsuro Fujita" <fujita.etsuro@lab.ntt.co.jp> writes:

I'd like to add the following options to the SQL COPY command and the psql \copy
instruction:

* PREPROCESSOR: Specifies the user-supplied program for COPY IN. The data
from an input file is preprocessed by the program before the data is loaded into
a postgres table.
* POSTPROCESSOR: Specifies the user-supplied program for COPY OUT. The data
from a postgres table is postprocessed by the program before the data is stored
in an output file.

The proposed patch causes the external processor programs to execute
with the privileges of the database server, which seems like a pretty
horrid idea. At the very least this would imply limiting use of the
feature to superusers, which greatly restricts its use-case.

I think it would be a lot better if this were designed so that the
processor programs executed on client side. Which would probably make
it not a COPY patch at all, but something in psql.

regards, tom lane

Dimitri Fontaine

dimitri@2ndQuadrant.fr

over 13 years ago

In reply to: Tom Lane (#4)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

Tom Lane <tgl@sss.pgh.pa.us> writes:

I think it would be a lot better if this were designed so that the
processor programs executed on client side. Which would probably make
it not a COPY patch at all, but something in psql.

And pgloader, which already has a part of that feature with the per
column reformating facility.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

Andrew Dunstan

andrew@dunslane.net

over 13 years ago

In reply to: Dimitri Fontaine (#5)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

On 09/13/2012 01:20 PM, Dimitri Fontaine wrote:

Tom Lane <tgl@sss.pgh.pa.us> writes:

I think it would be a lot better if this were designed so that the
processor programs executed on client side. Which would probably make
it not a COPY patch at all, but something in psql.

And pgloader, which already has a part of that feature with the per
column reformating facility.

Yeah, I'd be inclined to say that pre/post processing of this kind is
really a job for specialized clients.

cheers

andrew

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 13 years ago

In reply to: Andrew Dunstan (#6)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

OK I will redesign the function.

Thanks everyone for the advice!

Best regards,
Etsuro Fujita

Show quoted text

-----Original Message-----
From: Andrew Dunstan [mailto:andrew@dunslane.net]
Sent: Friday, September 14, 2012 2:27 AM
To: Dimitri Fontaine
Cc: Tom Lane; Etsuro Fujita; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] WIP patch: add (PRE|POST)PROCESSOR options to COPY

On 09/13/2012 01:20 PM, Dimitri Fontaine wrote:

Tom Lane <tgl@sss.pgh.pa.us> writes:

I think it would be a lot better if this were designed so that the
processor programs executed on client side. Which would probably make
it not a COPY patch at all, but something in psql.

And pgloader, which already has a part of that feature with the per
column reformating facility.

Yeah, I'd be inclined to say that pre/post processing of this kind is
really a job for specialized clients.

cheers

andrew

Craig Ringer

craig@2ndquadrant.com

over 13 years ago

In reply to: Tom Lane (#4)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

On 09/13/2012 10:25 PM, Tom Lane wrote:

I think it would be a lot better if this were designed so that the
processor programs executed on client side. Which would probably make
it not a COPY patch at all, but something in psql.

Either that, or allow the pre- and post- processors to be programs
written in a (possibly trusted) PL.

I can't say I really see the point though, when it's easy to just filter
the csv through a pipeline.

--
Craig Ringer

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 13 years ago

In reply to: Craig Ringer (#8)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

From: Craig Ringer [mailto:ringerc@ringerc.id.au]

On 09/13/2012 10:25 PM, Tom Lane wrote:

I think it would be a lot better if this were designed so that the
processor programs executed on client side. Which would probably make
it not a COPY patch at all, but something in psql.

Maybe my explanation was insufficient. Let me add one thing to my earlier
explanation. The submitted patch allows the psql \copy instruction to be
executed like:

$ echo '/bin/gunzip -c $1' > decompress.sh
$ chmod +x decompress.sh

postgres=# \copy foo FROM '/home/pgsql/foo.csv.gz' WITH (format 'csv',
preprocessor '/home/pgsql/decompress.sh')

In this example, command "/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz" is
executed on client side, by using popen(), and command "COPY foo FROM STDIN WITH
(format 'csv')" is sent to backend. I apologize for not providing you with
enough explanation.

Either that, or allow the pre- and post- processors to be programs
written in a (possibly trusted) PL.

I would like to add the hooks not only for the psql \copy instrucntion, but also
for the SQL COPY command, because I think the hooks for the SQL COPY command
would allow to realize variants of contrib/file_fdw such as compressed file FDW
and Hadoop HDFS FDW, etc., which I think would be useful especially for DWH
environments.

Thanks,

Best regards,
Etsuro Fujita

#10

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Etsuro Fujita (#9)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

"Etsuro Fujita" <fujita.etsuro@lab.ntt.co.jp> writes:

Maybe my explanation was insufficient. Let me add one thing to my earlier
explanation. The submitted patch allows the psql \copy instruction to be
executed like:

$ echo '/bin/gunzip -c $1' > decompress.sh
$ chmod +x decompress.sh

postgres=# \copy foo FROM '/home/pgsql/foo.csv.gz' WITH (format 'csv',
preprocessor '/home/pgsql/decompress.sh')

In this example, command "/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz" is
executed on client side, by using popen(), and command "COPY foo FROM STDIN WITH
(format 'csv')" is sent to backend. I apologize for not providing you with
enough explanation.

Well, in that case, you've got not only an explanation problem but a
syntax problem, because that syntax is utterly misleading. Anybody
looking at it would think that the "format" option is one of the options
being sent to the backend. The code required to pull it out of there
has got to be grossly overcomplicated (and likely bugprone), too.

I think it would be better to present this as something like

\copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with format 'csv'

which would cue any reasonably Unix-savvy person that what's happening
is a popen on the client side. It'd probably be a whole lot less
complicated to implement, too.

regards, tom lane

#11

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 13 years ago

In reply to: Tom Lane (#10)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Etsuro Fujita" <fujita.etsuro@lab.ntt.co.jp> writes:

Maybe my explanation was insufficient. Let me add one thing to my earlier
explanation. The submitted patch allows the psql \copy instruction to be
executed like:

$ echo '/bin/gunzip -c $1' > decompress.sh
$ chmod +x decompress.sh

postgres=# \copy foo FROM '/home/pgsql/foo.csv.gz' WITH (format 'csv',
preprocessor '/home/pgsql/decompress.sh')

Well, in that case, you've got not only an explanation problem but a
syntax problem, because that syntax is utterly misleading. Anybody
looking at it would think that the "format" option is one of the options
being sent to the backend. The code required to pull it out of there
has got to be grossly overcomplicated (and likely bugprone), too.

I think it would be better to present this as something like

\copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with
format 'csv'

which would cue any reasonably Unix-savvy person that what's happening
is a popen on the client side. It'd probably be a whole lot less
complicated to implement, too.

Great!

I have a question. I think it would be also better to extend the syntax for the
SQL COPY command in the same way, ie,

COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with format
'csv'

Is this okay?

Thanks,

Best regards,
Etsuro Fujita

#12

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Etsuro Fujita (#11)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

"Etsuro Fujita" <fujita.etsuro@lab.ntt.co.jp> writes:

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
I think it would be better to present this as something like
\copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with
format 'csv'

I have a question. I think it would be also better to extend the syntax for the
SQL COPY command in the same way, ie,
COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with format
'csv'

Yeah, sure --- that case is already superuser-only, so why not give it
the option of being a popen instead of just fopen. My objection was
only that it sounded like you were providing *only* the ability to run
the external processors on the server side. Providing popen on both
sides seems completely sensible.

regards, tom lane

#13

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Tom Lane (#12)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

I wrote:

"Etsuro Fujita" <fujita.etsuro@lab.ntt.co.jp> writes:

I have a question. I think it would be also better to extend the syntax for the
SQL COPY command in the same way, ie,
COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with format
'csv'

Yeah, sure --- that case is already superuser-only, so why not give it
the option of being a popen instead of just fopen.

BTW, one thought that comes to mind is that such an operation is
extremely likely to fail under environments such as SELinux. That's
not necessarily a reason not to do it, but we should be wary of
promising that it will work everywhere. Probably a documentation note
about this would be enough.

regards, tom lane

#14

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 13 years ago

In reply to: Tom Lane (#13)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

I wrote:

"Etsuro Fujita" <fujita.etsuro@lab.ntt.co.jp> writes:

I have a question. I think it would be also better to extend the syntax
for the SQL COPY command in the same way, ie,
COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with
format 'csv'

Yeah, sure --- that case is already superuser-only, so why not give it
the option of being a popen instead of just fopen.

BTW, one thought that comes to mind is that such an operation is
extremely likely to fail under environments such as SELinux. That's
not necessarily a reason not to do it, but we should be wary of
promising that it will work everywhere. Probably a documentation note
about this would be enough.

OK I'll revise the patch.

Thank you for your advice!

Best regards,
Etsuro Fujita

#15

Etsuro Fujita

fujita.etsuro@lab.ntt.co.jp

over 13 years ago

In reply to: Etsuro Fujita (#14)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

I wrote:

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

I wrote:

"Etsuro Fujita" <fujita.etsuro@lab.ntt.co.jp> writes:

I have a question. I think it would be also better to extend the syntax
for the SQL COPY command in the same way, ie,
COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with
format 'csv'

Yeah, sure --- that case is already superuser-only, so why not give it
the option of being a popen instead of just fopen.

BTW, one thought that comes to mind is that such an operation is
extremely likely to fail under environments such as SELinux. That's
not necessarily a reason not to do it, but we should be wary of
promising that it will work everywhere. Probably a documentation note
about this would be enough.

OK I'll revise the patch.

I've revised the patch. In this version a user can specify hooks for pre- and
post-processor executables for COPY and \copy in the follwoing way:

$ echo '/bin/gunzip -c $1' > decompress.sh
$ chmod +x decompress.sh

In the case of the COPY command,

postgres=# COPY foo FROM '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz'
WITH (format 'csv');

Also, in the case of the \copy instruction,

postgres=# \copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz'
with (format 'csv')

As shown in the example above, I've assumed that the syntax for this option for
e.g., the COPY command is:

COPY table_name FROM 'progname filename' WITH ...
COPY table_name TO 'progname filename' WITH ...

Here, progname for COPY IN is the user-supplied program that takes filename as
its argument and that writes on standard output. Also, prgoname for COPY OUT is
the user-supplied program that reads standard input and writes to filename taken
as its argument. This makes simple the identification and verification of
progname and filename.

Todo:
* Documentation including documentation note about the limitation for
environments such as SELinux mentioned by Tom.
* More test

Any comments and suggestions are welcomed.

Thanks,

Best regards,
Etsuro Fujita

#16

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Etsuro Fujita (#1)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

On 13 September 2012 10:13, Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp> wrote:

I'd like to add the following options to the SQL COPY command and the psql \copy
instruction:

* PREPROCESSOR: Specifies the user-supplied program for COPY IN. The data
from an input file is preprocessed by the program before the data is loaded into
a postgres table.
* POSTPROCESSOR: Specifies the user-supplied program for COPY OUT. The data
from a postgres table is postprocessed by the program before the data is stored
in an output file.

These options can be specified only when an input or output file is specified.

These options allow to move data between postgres tables and e.g., compressed
files or files on a distributed file system such as Hadoop HDFS.

These options look pretty strange to me and I'm not sure they are a good idea.

If we want to read other/complex data, we have Foreign Data Wrappers.

What I think we need is COPY FROM (SELECT....). COPY (query) TO
already exists, so this is just the same thing in the other direction.
Once we have a SELECT statement in both directions we can add any user
defined transforms we wish implemented as database functions.

At present we only support INSERT SELECT ... FROM FDW
which means all the optimisations we've put into COPY are useless with
FDWs. So we need a way to speed up loads from other data sources.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17

Fujii Masao

masao.fujii@gmail.com

over 13 years ago

In reply to: Etsuro Fujita (#15)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

On Wed, Nov 14, 2012 at 8:30 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:

I wrote:

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

I wrote:

"Etsuro Fujita" <fujita.etsuro@lab.ntt.co.jp> writes:

I have a question. I think it would be also better to extend the syntax
for the SQL COPY command in the same way, ie,
COPY foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz |' with
format 'csv'

Yeah, sure --- that case is already superuser-only, so why not give it
the option of being a popen instead of just fopen.

BTW, one thought that comes to mind is that such an operation is
extremely likely to fail under environments such as SELinux. That's
not necessarily a reason not to do it, but we should be wary of
promising that it will work everywhere. Probably a documentation note
about this would be enough.

OK I'll revise the patch.

I've revised the patch. In this version a user can specify hooks for pre- and
post-processor executables for COPY and \copy in the follwoing way:

$ echo '/bin/gunzip -c $1' > decompress.sh
$ chmod +x decompress.sh

In the case of the COPY command,

postgres=# COPY foo FROM '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz'
WITH (format 'csv');

Also, in the case of the \copy instruction,

postgres=# \copy foo from '/home/pgsql/decompress.sh /home/pgsql/foo.csv.gz'
with (format 'csv')

As shown in the example above, I've assumed that the syntax for this option for
e.g., the COPY command is:

COPY table_name FROM 'progname filename' WITH ...
COPY table_name TO 'progname filename' WITH ...

Here, progname for COPY IN is the user-supplied program that takes filename as
its argument and that writes on standard output.

What about further extending the COPY IN syntax to the following?

COPY table_name FROM 'progname [ option, ... ]' WITH ...

I'd just like to execute

COPY vmstat_table FROM 'vmstat' WITH ...

Also, prgoname for COPY OUT is
the user-supplied program that reads standard input and writes to filename taken
as its argument. This makes simple the identification and verification of
progname and filename.

Todo:
* Documentation including documentation note about the limitation for
environments such as SELinux mentioned by Tom.
* More test

Any comments and suggestions are welcomed.

Isn't it dangerous to allow a user to execute external program in
server side via SQL?

Regards,

--
Fujii Masao

#18

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Fujii Masao (#17)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

On 14 November 2012 15:09, Fujii Masao <masao.fujii@gmail.com> wrote:

Here, progname for COPY IN is the user-supplied program that takes filename as
its argument and that writes on standard output.

What about further extending the COPY IN syntax to the following?

COPY table_name FROM 'progname [ option, ... ]' WITH ...

I'd just like to execute

COPY vmstat_table FROM 'vmstat' WITH ...

I think we should be using FDWs/SRFs here, not inventing new
syntax/architectures for executing external code, so -1 from me.

We can already do
INSERT table SELECT * FROM fdw;
with any logic for generating data lives inside an FDW or SRF.

If we want it in COPY we can have syntax like this...
COPY table FROM (SELECT * FROM fdw)

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#19

Fujii Masao

masao.fujii@gmail.com

over 13 years ago

In reply to: Simon Riggs (#18)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

On Thu, Nov 15, 2012 at 12:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 14 November 2012 15:09, Fujii Masao <masao.fujii@gmail.com> wrote:

Here, progname for COPY IN is the user-supplied program that takes filename as
its argument and that writes on standard output.

What about further extending the COPY IN syntax to the following?

COPY table_name FROM 'progname [ option, ... ]' WITH ...

I'd just like to execute

COPY vmstat_table FROM 'vmstat' WITH ...

I think we should be using FDWs/SRFs here, not inventing new
syntax/architectures for executing external code, so -1 from me.

We can already do
INSERT table SELECT * FROM fdw;
with any logic for generating data lives inside an FDW or SRF.

If we want it in COPY we can have syntax like this...
COPY table FROM (SELECT * FROM fdw)

New syntax looks attractive to me because it's easy to use that.
It's not easy to implement the FDW for the external program which
a user wants to execute.

Of course if someone implements something like any_external_program_fdw,
I would change my mind..

Regards,

--
Fujii Masao

#20

Alvaro Herrera

alvherre@2ndquadrant.com

over 13 years ago

In reply to: Simon Riggs (#18)

Re: WIP patch: add (PRE|POST)PROCESSOR options to COPY

Simon Riggs escribió:

On 14 November 2012 15:09, Fujii Masao <masao.fujii@gmail.com> wrote:

Here, progname for COPY IN is the user-supplied program that takes filename as
its argument and that writes on standard output.

What about further extending the COPY IN syntax to the following?

COPY table_name FROM 'progname [ option, ... ]' WITH ...

I'd just like to execute

COPY vmstat_table FROM 'vmstat' WITH ...

I think we should be using FDWs/SRFs here, not inventing new
syntax/architectures for executing external code, so -1 from me.

Hmm, but then you are forced to write C code, whereas the "external
program" proposal could have you writing a only shell script instead.
So there is some merit to this idea ... though we could have a
"pipe_fdw" that could let you specify an arbitrary program to run.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Alvaro Herrera (#20)

#22

Andrew Dunstan

andrew@dunslane.net

over 13 years ago

In reply to: Tom Lane (#21)

#23