Let file_fdw access COPY FROM PROGRAM
A while back, there was a push to make COPY gzip-aware. That didn't happen,
but COPY FROM PROGRAM did, and it scratches the same itch.
I have a similar need, but with file_fdw foreign tables. I have .csv.gz
files downloaded to the server, but those CSVs have 100+ columns in them,
and in this case I only really care about a half dozen of those columns.
I'd like to avoid:
- the overhead of writing the uncompressed file to disk and then
immediately re-reading it
- writing unwanted columns to a temp/work table via COPY, and then
immediately re-reading them
- multicorn fdw because it ends up making a python string out of all data
cells
- a csv parsing tool like csvtool or mlr, because they output another CSV
which must be reparsed from scratch
Since file_fdw leverages COPY, it seemed like it would be easy to add the
FROM PROGRAM feature to file_fdw. I began asking questions on #postgresql
IRC, only to discover that Adam Gomaa ( akgomaa@gmail.com ) had already
written such a thing, but hadn't submitted it. Attached is a small rework
of his patch, along with documentation.
NOTE: The regression test includes unix commands in the program option. I
figured that wouldn't work for win32 systems, so I checked to see what the
regression tests do to test COPY FROM PROGRAM...and I couldn't find any. So
I guess the test exists as a proof of concept that will get excised before
final commit.
Attachments:
file_fdw_program_option-2016-06-02-v2.difftext/plain; charset=US-ASCII; name=file_fdw_program_option-2016-06-02-v2.diffDownload+143-44
On 3 June 2016 at 04:48, Corey Huinker <corey.huinker@gmail.com> wrote:
A while back, there was a push to make COPY gzip-aware. That didn't
happen, but COPY FROM PROGRAM did, and it scratches the same itch.
- writing unwanted columns to a temp/work table via COPY, and then
immediately re-reading them
Without wanting to take away from the value of letting file FDW access FROM
PROGRAM, I think this really merits a solution that applies to COPY as
well. Variants on "how do I COPY just some columns from a CSV" is a real
FAQ, and it doesn't seem like it'd be excessively hard to support. We'd
just need some way to pass a list of column-ordinals or header-names.
Not asking you to do that work, just pointing out that this particular
issue applies to COPY its self as well.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 3, 2016 at 1:06 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 3 June 2016 at 04:48, Corey Huinker <corey.huinker@gmail.com> wrote:
A while back, there was a push to make COPY gzip-aware. That didn't
happen, but COPY FROM PROGRAM did, and it scratches the same itch.- writing unwanted columns to a temp/work table via COPY, and then
immediately re-reading themWithout wanting to take away from the value of letting file FDW access
FROM PROGRAM, I think this really merits a solution that applies to COPY as
well. Variants on "how do I COPY just some columns from a CSV" is a real
FAQ, and it doesn't seem like it'd be excessively hard to support. We'd
just need some way to pass a list of column-ordinals or header-names.Not asking you to do that work, just pointing out that this particular
issue applies to COPY its self as well.
I agree, they are two different but slightly overlapping issues. file_fdw
needs a way to handle compressed/filtered files, and COPY needs the ability
to skip columns. But right now COPY is all about getting the shape of the
input from the shape of the destination table.
I had the bright idea of creating a custom datatype, call it "skip" /
"filler" / "nada" / "devnull" or something like that, that would map any
and all values to NULL, thus allowing COPY to naively "store" the unwanted
values into nothingness.
That idea falls apart when it hits InputFunctionCall() in fmgr.c, which
prevents any text string from coercing into NULL. I didn't think I was up
to the task of making InputFunctionCall respect a special case type (void?)
or somehow promoting the pseudo type void into a legit column type (unknown
side-effects there). I did create a type that coerced all input into the
empty string, but the disk usage of that was still pretty significant.
So maybe if there was a subspecies of COPY that returned SETOF RECORD:
SELECT important_column1, important_column2
FROM copy_srf( program := 'zcat filename', format := 'csv', header := true
) AS t(bogus_col1 text, important_column1 integer, important_column2 date,
bogus_col2 text, ... );
That would allow COPY to keep it's current efficiency and simplicity, while
(hopefully) avoiding the unwanted data from ever hitting disk. It would
also be vaguely reminiscent of SQL*Loader.
On Thu, Jun 2, 2016 at 4:48 PM, Corey Huinker <corey.huinker@gmail.com> wrote:
A while back, there was a push to make COPY gzip-aware. That didn't happen,
but COPY FROM PROGRAM did, and it scratches the same itch.I have a similar need, but with file_fdw foreign tables. I have .csv.gz
files downloaded to the server, but those CSVs have 100+ columns in them,
and in this case I only really care about a half dozen of those columns. I'd
like to avoid:
- the overhead of writing the uncompressed file to disk and then immediately
re-reading it
- writing unwanted columns to a temp/work table via COPY, and then
immediately re-reading them
- multicorn fdw because it ends up making a python string out of all data
cells
- a csv parsing tool like csvtool or mlr, because they output another CSV
which must be reparsed from scratchSince file_fdw leverages COPY, it seemed like it would be easy to add the
FROM PROGRAM feature to file_fdw. I began asking questions on #postgresql
IRC, only to discover that Adam Gomaa ( akgomaa@gmail.com ) had already
written such a thing, but hadn't submitted it. Attached is a small rework of
his patch, along with documentation.
His failure to submit that here himself raises the question of whether
he is OK with that code being released under the PostgreSQL license.
If this patch is going to be considered, I think we should have a post
from him clarifying that matter.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I'm fine with the code being released under the PostgreSQL license.
Thanks,
Adam
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi Corey,
Here are some comments and a review of the patch.
On 2016/06/03 5:48, Corey Huinker wrote:
A while back, there was a push to make COPY gzip-aware. That didn't happen,
but COPY FROM PROGRAM did, and it scratches the same itch.I have a similar need, but with file_fdw foreign tables. I have .csv.gz
files downloaded to the server, but those CSVs have 100+ columns in them,
and in this case I only really care about a half dozen of those columns.
I'd like to avoid:
- the overhead of writing the uncompressed file to disk and then
immediately re-reading it
- writing unwanted columns to a temp/work table via COPY, and then
immediately re-reading them
- multicorn fdw because it ends up making a python string out of all data
cells
- a csv parsing tool like csvtool or mlr, because they output another CSV
which must be reparsed from scratch
This feature seems desirable to me too.
Since file_fdw leverages COPY, it seemed like it would be easy to add the
FROM PROGRAM feature to file_fdw. I began asking questions on #postgresql
IRC, only to discover that Adam Gomaa ( akgomaa@gmail.com ) had already
written such a thing, but hadn't submitted it. Attached is a small rework
of his patch, along with documentation.
Yeah, the fact that file_fdw leverages COPY makes including this feature a
breeze. Various concerns such as security, popen() vs execve() were
addressed when COPY FROM PROGRAM/STDIN/STDOUT feature was proposed [1]/messages/by-id/002101cd9190$081c4140$1854c3c0$@lab.ntt.co.jp, so
we will not have to go through that again (hopefully).
NOTE: The regression test includes unix commands in the program option. I
figured that wouldn't work for win32 systems, so I checked to see what the
regression tests do to test COPY FROM PROGRAM...and I couldn't find any. So
I guess the test exists as a proof of concept that will get excised before
final commit.
I am not familiar with win32 stuff too, so I don't have much to say about
that. Maybe someone else can chime in to help with that.
About the patch:
* applies cleanly, compiles fine
* basic functionality seems to work (have not done any extensive tests though)
- Specifies the file to be read. Required. Must be an absolute path
name.
+ Specifies the file to be read. Must be an absolute path name.
+ Either <literal>filename</literal> or <literal>program</literal> must be
+ specified. They are mutually exclusive.
+ </para>
+ </listitem>
+ </varlistentry>
The note about filename and program being mutually exclusive could be
placed at the end of list of options. Or perhaps mention this note as
part of the description of program option, because filename is already
defined by that point.
+ <listitem>
+ <para>
+ Specifies the command to executed.
s/to executed/to be executed/g
+ Either <literal>program</literal> or <literal>filename</literal> must be
+ specified. They are mutually exclusive.
</para>
Oh I see this has already been mentioned in program option description.
Did you intend to specify the same in both filename and program descriptions?
@@ -97,8 +99,9 @@ typedef struct FileFdwPlanState
typedef struct FileFdwExecutionState
{
char *filename; /* file to read */
- List *options; /* merged COPY options, excluding filename */
- CopyState cstate; /* state of reading file */
+ char *program; /* program to read output from */
+ List *options; /* merged COPY options, excluding
filename and program */
+ CopyState cstate; /* state of reading file or program */
Have you considered a Boolean flag is_program instead of char **program
similar to how copy.c does it? (See a related comment further below)
- * This is because the filename is one of those options, and we don't
want
- * non-superusers to be able to determine which file gets read.
+ * This is because the filename or program string are two of those
+ * options, and we don't want non-superusers to be able to determine
which
+ * file gets read or what command is run.
I'm no native English speaker, but I wonder if this should read: filename
or program string *is* one of those options ... OR filename *and* program
are two of those options ... ? Also, "are two of those options" sounds a
bit odd to me because I don't see that used as often as "one of those
whatever". I guess that's quite a bit of nitpicking about a C comment, ;)
@@ -257,9 +264,26 @@ file_fdw_validator(PG_FUNCTION_ARGS)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("conflicting or redundant options")));
+ if (program)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("conflicting or redundant options")));
filename = defGetString(def);
}
+ else if (strcmp(def->defname, "program") == 0)
+ {
+ if (filename)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("conflicting or redundant options")));
+ if (program)
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("conflicting or redundant options")));
+ program = defGetString(def);
+ }
Why does it check for filename here? Also the other way around above.
@@ -632,11 +671,18 @@ fileBeginForeignScan(ForeignScanState *node, int eflags)
* Create CopyState from FDW options. We always acquire all columns, so
* as to match the expected ScanTupleSlot signature.
*/
- cstate = BeginCopyFrom(node->ss.ss_currentRelation,
- filename,
- false,
- NIL,
- options);
+ if(filename)
+ cstate = BeginCopyFrom(node->ss.ss_currentRelation,
+ filename,
+ false,
+ NIL,
+ options);
+ else
+ cstate = BeginCopyFrom(node->ss.ss_currentRelation,
+ program,
+ true,
+ NIL,
+ options)
Like I mentioned above, if there was a is_program Boolean flag instead of
separate filename and program, this would be just:
+ cstate = BeginCopyFrom(node->ss.ss_currentRelation,
+ filename,
+ is_program,
+ NIL,
+ options);
That would get rid of if-else blocks here and a couple of other places.
diff --git a/contrib/file_fdw/input/file_fdw.source
b/contrib/file_fdw/input/file_fdw.source
index 685561f..eae34a4 100644
--- a/contrib/file_fdw/input/file_fdw.source
+++ b/contrib/file_fdw/input/file_fdw.source
You forgot to include expected output diffs.
Thanks,
Amit
[1]: /messages/by-id/002101cd9190$081c4140$1854c3c0$@lab.ntt.co.jp
/messages/by-id/002101cd9190$081c4140$1854c3c0$@lab.ntt.co.jp
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 2, 2016 at 5:07 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
wrote:
I am not familiar with win32 stuff too, so I don't have much to say about
that. Maybe someone else can chime in to help with that.
The regressions basically *can't* test this because we'd need a shell
command we know works on any architecture.
About the patch:
* applies cleanly, compiles fine
* basic functionality seems to work (have not done any extensive tests
though)- Specifies the file to be read. Required. Must be an absolute path name. + Specifies the file to be read. Must be an absolute path name. + Either <literal>filename</literal> or <literal>program</literal> must be + specified. They are mutually exclusive. + </para> + </listitem> + </varlistentry>The note about filename and program being mutually exclusive could be
placed at the end of list of options. Or perhaps mention this note as
part of the description of program option, because filename is already
defined by that point.+ <listitem> + <para> + Specifies the command to executed.s/to executed/to be executed/g
Correct. I will fix that when other issues below are resolved.
+ Either <literal>program</literal> or <literal>filename</literal> must be + specified. They are mutually exclusive. </para>Oh I see this has already been mentioned in program option description.
Did you intend to specify the same in both filename and program
descriptions?
It's been a while since I repackaged Adam's code, but generally I'm in
favor of some redundancy if the two mutually exclusive things are
documented far enough apart.
@@ -97,8 +99,9 @@ typedef struct FileFdwPlanState typedef struct FileFdwExecutionState { char *filename; /* file to read */ - List *options; /* merged COPY options, excluding filename */ - CopyState cstate; /* state of reading file */ + char *program; /* program to read output from */ + List *options; /* merged COPY options, excluding filename and program */ + CopyState cstate; /* state of reading file or program */Have you considered a Boolean flag is_program instead of char **program
similar to how copy.c does it? (See a related comment further below)
Considered it yes, but decided against it when I started to write my
version. When Adam delivered his version of the patch, I noticed he had
made the same design decision. Only one of them will be initialized, and
the boolean will byte align to 4 bytes, so it's the same storage allocated
either way.
Either we're fine with two variables, or we think file_name is poorly
named. I have only a slight preference for the two variables, and defer to
the group for a preference.
- * This is because the filename is one of those options, and we don't want - * non-superusers to be able to determine which file gets read. + * This is because the filename or program string are two of those + * options, and we don't want non-superusers to be able to determine which + * file gets read or what command is run.I'm no native English speaker, but I wonder if this should read: filename
or program string *is* one of those options ... OR filename *and* program
are two of those options ... ? Also, "are two of those options" sounds a
bit odd to me because I don't see that used as often as "one of those
whatever". I guess that's quite a bit of nitpicking about a C comment, ;)
Given that C comments constitute a large portion of our documentation, I
fully support making them as clear as possible.
I don't remember if this is Adam's comment or mine. Adam - if you're out
there, chime in.
The original paragraph was:
Changing table-level options requires superuser privileges, for security
reasons: only a superuser should be able to determine which file is read.
In principle non-superusers could be allowed to change the other options,
but that's not supported at present.
How does this paragraph sound?:
Changing table-level options requires superuser privileges, for security
reasons: only a superuser should be able to determine which file is read,
or which program is run. In principle non-superusers could be allowed to
change the other options, but that's not supported at present.
@@ -257,9 +264,26 @@ file_fdw_validator(PG_FUNCTION_ARGS) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("conflicting or redundant options"))); + if (program) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); filename = defGetString(def); }+ else if (strcmp(def->defname, "program") == 0) + { + if (filename) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); + if (program) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); + program = defGetString(def); + }Why does it check for filename here? Also the other way around above.
We don't want them to specify both program and filename, nor do we allow 2
programs, or 2 filenames.
@@ -632,11 +671,18 @@ fileBeginForeignScan(ForeignScanState *node, int eflags) * Create CopyState from FDW options. We always acquire all columns, so * as to match the expected ScanTupleSlot signature. */ - cstate = BeginCopyFrom(node->ss.ss_currentRelation, - filename, - false, - NIL, - options); + if(filename) + cstate = BeginCopyFrom(node->ss.ss_currentRelation, + filename, + false, + NIL, + options); + else + cstate = BeginCopyFrom(node->ss.ss_currentRelation, + program, + true, + NIL, + options)Like I mentioned above, if there was a is_program Boolean flag instead of
separate filename and program, this would be just:+ cstate = BeginCopyFrom(node->ss.ss_currentRelation, + filename, + is_program, + NIL, + options);That would get rid of if-else blocks here and a couple of other places.
It would, pushing the complexity out to the user. Which could be fine, but
if we do that then "filename" is a misnomer.
diff --git a/contrib/file_fdw/input/file_fdw.source b/contrib/file_fdw/input/file_fdw.source index 685561f..eae34a4 100644 --- a/contrib/file_fdw/input/file_fdw.source +++ b/contrib/file_fdw/input/file_fdw.sourceYou forgot to include expected output diffs.
Having regression tests for this is extremely problematic, because the
program invoked would need to be an invokable command on any architecture
supported by postgres. I'm pretty sure no such command exists.
On 7 Sep. 2016 02:14, "Corey Huinker" <corey.huinker@gmail.com> wrote:
Having regression tests for this is extremely problematic, because the
program invoked would need to be an invokable command on any architecture
supported by postgres. I'm pretty sure no such command exists.
Your best bet will be using the TAP framework. There you can use Perl logic.
I'm not sure where to put such a test though. It doesn't really make sense
in src/test/recovery/ .
On Wed, Sep 7, 2016 at 7:53 AM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:
On 7 Sep. 2016 02:14, "Corey Huinker" <corey.huinker@gmail.com> wrote:
Having regression tests for this is extremely problematic, because the
program invoked would need to be an invokable command on any architecture
supported by postgres. I'm pretty sure no such command exists.Your best bet will be using the TAP framework. There you can use Perl logic.
I'm not sure where to put such a test though. It doesn't really make sense
in src/test/recovery/ .
There is nothing preventing the addition of a TAP test where there are
normal regression tests, so if you want a test for file_fdw you should
add it there, then change its Makefile to have the following target
rules:
check: prove-check
prove-check:
$(prove_check)
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2016/09/07 3:12, Corey Huinker wrote:
On Fri, Sep 2, 2016 at 5:07 AM, Amit Langote wrote:
I am not familiar with win32 stuff too, so I don't have much to say about
that. Maybe someone else can chime in to help with that.The regressions basically *can't* test this because we'd need a shell
command we know works on any architecture.
OK.
@@ -97,8 +99,9 @@ typedef struct FileFdwPlanState typedef struct FileFdwExecutionState { char *filename; /* file to read */ - List *options; /* merged COPY options, excluding filename */ - CopyState cstate; /* state of reading file */ + char *program; /* program to read output from */ + List *options; /* merged COPY options, excluding filename and program */ + CopyState cstate; /* state of reading file or program */Have you considered a Boolean flag is_program instead of char **program
similar to how copy.c does it? (See a related comment further below)Considered it yes, but decided against it when I started to write my
version. When Adam delivered his version of the patch, I noticed he had
made the same design decision. Only one of them will be initialized, and
the boolean will byte align to 4 bytes, so it's the same storage allocated
either way.Either we're fine with two variables, or we think file_name is poorly
named. I have only a slight preference for the two variables, and defer to
the group for a preference.
OK, let's defer it to the committer.
- * This is because the filename is one of those options, and we don't want - * non-superusers to be able to determine which file gets read. + * This is because the filename or program string are two of those + * options, and we don't want non-superusers to be able to determine which + * file gets read or what command is run.I'm no native English speaker, but I wonder if this should read: filename
or program string *is* one of those options ... OR filename *and* program
are two of those options ... ? Also, "are two of those options" sounds a
bit odd to me because I don't see that used as often as "one of those
whatever". I guess that's quite a bit of nitpicking about a C comment, ;)Given that C comments constitute a large portion of our documentation, I
fully support making them as clear as possible.I don't remember if this is Adam's comment or mine. Adam - if you're out
there, chime in.The original paragraph was:
Changing table-level options requires superuser privileges, for security
reasons: only a superuser should be able to determine which file is read.
In principle non-superusers could be allowed to change the other options,
but that's not supported at present.How does this paragraph sound?:
Changing table-level options requires superuser privileges, for security
reasons: only a superuser should be able to determine which file is read,
or which program is run. In principle non-superusers could be allowed to
change the other options, but that's not supported at present.
Hmm, just a little modification would make it better for me:
... for security reasons. For example, only superuser should be able to
determine which file read or which program is run.
Although that could be just me.
@@ -257,9 +264,26 @@ file_fdw_validator(PG_FUNCTION_ARGS) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("conflicting or redundant options"))); + if (program) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); filename = defGetString(def); }+ else if (strcmp(def->defname, "program") == 0) + { + if (filename) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); + if (program) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); + program = defGetString(def); + }Why does it check for filename here? Also the other way around above.
We don't want them to specify both program and filename, nor do we allow 2
programs, or 2 filenames.
Ah, I forgot about the mutually exclusive options part.
@@ -632,11 +671,18 @@ fileBeginForeignScan(ForeignScanState *node, int eflags) * Create CopyState from FDW options. We always acquire all columns, so * as to match the expected ScanTupleSlot signature. */ - cstate = BeginCopyFrom(node->ss.ss_currentRelation, - filename, - false, - NIL, - options); + if(filename) + cstate = BeginCopyFrom(node->ss.ss_currentRelation, + filename, + false, + NIL, + options); + else + cstate = BeginCopyFrom(node->ss.ss_currentRelation, + program, + true, + NIL, + options)Like I mentioned above, if there was a is_program Boolean flag instead of
separate filename and program, this would be just:+ cstate = BeginCopyFrom(node->ss.ss_currentRelation, + filename, + is_program, + NIL, + options);That would get rid of if-else blocks here and a couple of other places.
It would, pushing the complexity out to the user. Which could be fine, but
if we do that then "filename" is a misnomer.
This is an internal state so I'm not sure how this would be pushing
complexity out to the user. Am I perhaps misreading what you said?
What a user sees is that there are two separate foreign table options -
filename and program. That we handle them internally using a string to
identify either file or program and a Boolean flag to show which of the
two is just an internal implementation detail.
COPY does it that way internally and I just saw that psql's \copy does it
the same way too. In the latter's case, following is the options struct
(src/bin/psql/copy.c):
struct copy_options
{
char *before_tofrom; /* COPY string before TO/FROM */
char *after_tofrom; /* COPY string after TO/FROM filename */
char *file; /* NULL = stdin/stdout */
bool program; /* is 'file' a program to popen? */
bool psql_inout; /* true = use psql stdin/stdout */
bool from; /* true = FROM, false = TO */
};
But as you said above, this could be deferred to the committer.
diff --git a/contrib/file_fdw/input/file_fdw.source b/contrib/file_fdw/input/file_fdw.source index 685561f..eae34a4 100644 --- a/contrib/file_fdw/input/file_fdw.source +++ b/contrib/file_fdw/input/file_fdw.sourceYou forgot to include expected output diffs.
Having regression tests for this is extremely problematic, because the
program invoked would need to be an invokable command on any architecture
supported by postgres. I'm pretty sure no such command exists.
Craig and Michael elsewhere suggested something about adding TAP tests. If
that helps the situation, maybe you could.
I will mark the CF entry status to "Waiting on author" till you post a new
patch.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 6, 2016 at 6:53 PM, Craig Ringer <craig.ringer@2ndquadrant.com>
wrote:
On 7 Sep. 2016 02:14, "Corey Huinker" <corey.huinker@gmail.com> wrote:
Having regression tests for this is extremely problematic, because the
program invoked would need to be an invokable command on any architecture
supported by postgres. I'm pretty sure no such command exists.Your best bet will be using the TAP framework. There you can use Perl
logic.I'm not sure where to put such a test though. It doesn't really make sense
in src/test/recovery/ .
And the TAP test would detect the operating system and know to create an
FDW that has the PROGRAM value 'cat test_data.csv' on Unix, 'type
test_data.csv' on windows, and 'type test_data.csv;1' on VMS?
On 7 September 2016 at 11:21, Corey Huinker <corey.huinker@gmail.com> wrote:
On Tue, Sep 6, 2016 at 6:53 PM, Craig Ringer <craig.ringer@2ndquadrant.com>
And the TAP test would detect the operating system and know to create an FDW
that has the PROGRAM value 'cat test_data.csv' on Unix, 'type test_data.csv'
on windows, and 'type test_data.csv;1' on VMS?
Right. Or just "perl emit_test_data.pl" that works for all of them,
since TAP is perl so you can safely assume you have Perl.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 6, 2016 at 9:46 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
wrote:
On 2016/09/07 3:12, Corey Huinker wrote:
On Fri, Sep 2, 2016 at 5:07 AM, Amit Langote wrote:
I am not familiar with win32 stuff too, so I don't have much to say
about
that. Maybe someone else can chime in to help with that.
The regressions basically *can't* test this because we'd need a shell
command we know works on any architecture.OK.
Well...maybe not, depending on what Craig and other can do to educate me
about the TAP tests.
Changing table-level options requires superuser privileges, for security
reasons: only a superuser should be able to determine which file is read,
or which program is run. In principle non-superusers could be allowed to
change the other options, but that's not supported at present.Hmm, just a little modification would make it better for me:
... for security reasons. For example, only superuser should be able to
determine which file read or which program is run.Although that could be just me.
In this case "determine" is unclear whether a non-superuser can set the
program to be run, or is capable of knowing which program is set to be run
by the fdw.
We may want some more opinions on what is the most clear.
@@ -632,11 +671,18 @@ fileBeginForeignScan(ForeignScanState *node, int
eflags)
* Create CopyState from FDW options. We always acquire allcolumns,
so * as to match the expected ScanTupleSlot signature. */ - cstate = BeginCopyFrom(node->ss.ss_currentRelation, - filename, - false, - NIL, - options); + if(filename) + cstate = BeginCopyFrom(node->ss.ss_currentRelation, + filename, + false, + NIL, + options); + else + cstate = BeginCopyFrom(node->ss.ss_currentRelation, + program, + true, + NIL, + options)Like I mentioned above, if there was a is_program Boolean flag instead
of
separate filename and program, this would be just:
+ cstate = BeginCopyFrom(node->ss.ss_currentRelation, + filename, + is_program, + NIL, + options);That would get rid of if-else blocks here and a couple of other places.
It would, pushing the complexity out to the user. Which could be fine,
but
if we do that then "filename" is a misnomer.
This is an internal state so I'm not sure how this would be pushing
complexity out to the user. Am I perhaps misreading what you said?
Indeed, it is internal state. Maybe we rename the variable file_command or
something.
What a user sees is that there are two separate foreign table options -
filename and program. That we handle them internally using a string to
identify either file or program and a Boolean flag to show which of the
two is just an internal implementation detail.COPY does it that way internally and I just saw that psql's \copy does it
the same way too. In the latter's case, following is the options struct
(src/bin/psql/copy.c):struct copy_options
{
char *before_tofrom; /* COPY string before TO/FROM */
char *after_tofrom; /* COPY string after TO/FROM filename */
char *file; /* NULL = stdin/stdout */
bool program; /* is 'file' a program to popen? */
bool psql_inout; /* true = use psql stdin/stdout */
bool from; /* true = FROM, false = TO */
};But as you said above, this could be deferred to the committer.
Yeah, and that made for zero storage savings: a char pointer which is never
assigned a string takes up as much space as a 4-byte-aligned boolean. And
the result is that "file" really means program, which I found slightly
awkward.
diff --git a/contrib/file_fdw/input/file_fdw.source b/contrib/file_fdw/input/file_fdw.source index 685561f..eae34a4 100644 --- a/contrib/file_fdw/input/file_fdw.source +++ b/contrib/file_fdw/input/file_fdw.sourceYou forgot to include expected output diffs.
Having regression tests for this is extremely problematic, because the
program invoked would need to be an invokable command on any architecture
supported by postgres. I'm pretty sure no such command exists.Craig and Michael elsewhere suggested something about adding TAP tests. If
that helps the situation, maybe you could.
Yeah, I need to educate myself about those.
I will mark the CF entry status to "Waiting on author" till you post a new
patch.
Thanks.
Show quoted text
On Tue, Sep 6, 2016 at 11:24 PM, Craig Ringer <craig.ringer@2ndquadrant.com>
wrote:
On 7 September 2016 at 11:21, Corey Huinker <corey.huinker@gmail.com>
wrote:On Tue, Sep 6, 2016 at 6:53 PM, Craig Ringer <
craig.ringer@2ndquadrant.com>
And the TAP test would detect the operating system and know to create an
FDW
that has the PROGRAM value 'cat test_data.csv' on Unix, 'type
test_data.csv'
on windows, and 'type test_data.csv;1' on VMS?
Right. Or just "perl emit_test_data.pl" that works for all of them,
since TAP is perl so you can safely assume you have Perl.
Thanks. I was mentally locked in more basic OS commands. Am I right in
thinking perl is about the *only* OS command you can be sure is on every
architecture?
The platforms page says we support S/390 but no mention of VM/MVS/CMS. Did
we do an OS/400 port yet? ::ducks::
On 7 September 2016 at 11:37, Corey Huinker <corey.huinker@gmail.com> wrote:
On Tue, Sep 6, 2016 at 11:24 PM, Craig Ringer <craig.ringer@2ndquadrant.com>
wrote:On 7 September 2016 at 11:21, Corey Huinker <corey.huinker@gmail.com>
wrote:On Tue, Sep 6, 2016 at 6:53 PM, Craig Ringer
<craig.ringer@2ndquadrant.com>And the TAP test would detect the operating system and know to create an
FDW
that has the PROGRAM value 'cat test_data.csv' on Unix, 'type
test_data.csv'
on windows, and 'type test_data.csv;1' on VMS?Right. Or just "perl emit_test_data.pl" that works for all of them,
since TAP is perl so you can safely assume you have Perl.Thanks. I was mentally locked in more basic OS commands. Am I right in
thinking perl is about the *only* OS command you can be sure is on every
architecture?
Probably, there's a lot of crazy out there.
TAP tests can be conditionally run based on architecture, but
something like this is probably worth testing as widely as possible.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2016/09/07 12:29, Corey Huinker wrote:
On Tue, Sep 6, 2016 at 9:46 PM, Amit Langote wrote:
OK.
Well...maybe not, depending on what Craig and other can do to educate me
about the TAP tests.
Sure.
Changing table-level options requires superuser privileges, for security
reasons: only a superuser should be able to determine which file is read,
or which program is run. In principle non-superusers could be allowed to
change the other options, but that's not supported at present.Hmm, just a little modification would make it better for me:
... for security reasons. For example, only superuser should be able to
determine which file read or which program is run.Although that could be just me.
In this case "determine" is unclear whether a non-superuser can set the
program to be run, or is capable of knowing which program is set to be run
by the fdw.
Hmm, it is indeed unclear.
How about:
... for security reasons. For example, only superuser should be able to
*change* which file is read or which program is run.
I just realized this is not just about a C comment. There is a line in
documentation as well which needs an update. Any conclusion here should
be applied there.
We may want some more opinions on what is the most clear.
Certainly.
But as you said above, this could be deferred to the committer.
Yeah, and that made for zero storage savings: a char pointer which is never
assigned a string takes up as much space as a 4-byte-aligned boolean. And
the result is that "file" really means program, which I found slightly
awkward.
My only intent to push for that approach is to have consistency with other
code implementing a similar feature although it may not be that important.
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 6, 2016 at 11:44 PM, Craig Ringer <craig.ringer@2ndquadrant.com>
wrote:
On 7 September 2016 at 11:37, Corey Huinker <corey.huinker@gmail.com>
wrote:On Tue, Sep 6, 2016 at 11:24 PM, Craig Ringer <
craig.ringer@2ndquadrant.com>
wrote:
On 7 September 2016 at 11:21, Corey Huinker <corey.huinker@gmail.com>
wrote:On Tue, Sep 6, 2016 at 6:53 PM, Craig Ringer
<craig.ringer@2ndquadrant.com>And the TAP test would detect the operating system and know to create
an
FDW
that has the PROGRAM value 'cat test_data.csv' on Unix, 'type
test_data.csv'
on windows, and 'type test_data.csv;1' on VMS?Right. Or just "perl emit_test_data.pl" that works for all of them,
since TAP is perl so you can safely assume you have Perl.Thanks. I was mentally locked in more basic OS commands. Am I right in
thinking perl is about the *only* OS command you can be sure is on every
architecture?Probably, there's a lot of crazy out there.
TAP tests can be conditionally run based on architecture, but
something like this is probably worth testing as widely as possible.--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Stylistically, would a separate .pl file for the emitter be preferable to
something inline like
perl -e 'print "a\tb\tcc\t4\n"; print "b\tc\tdd\t5\n"'
?
I'm inclined to go inline to cut down on the number of moving parts, but I
can see where perl's readability is a barrier to some, and either way I
want to follow established patterns.
[*] For those who don't perl, the command prints:
a b cc 4
b c dd 5
On 9 Sep. 2016 03:45, "Corey Huinker" <corey.huinker@gmail.com> wrote:
Stylistically, would a separate .pl file for the emitter be preferable to
something inline like
perl -e 'print "a\tb\tcc\t4\n"; print "b\tc\tdd\t5\n"'
I'd be fine with that and a suitable comment. Just be careful with
different platforms' shell escaping rules.
On Thu, Sep 8, 2016 at 6:59 PM, Craig Ringer <craig.ringer@2ndquadrant.com>
wrote:
On 9 Sep. 2016 03:45, "Corey Huinker" <corey.huinker@gmail.com> wrote:
Stylistically, would a separate .pl file for the emitter be preferable
to something inline like
perl -e 'print "a\tb\tcc\t4\n"; print "b\tc\tdd\t5\n"'
I'd be fine with that and a suitable comment. Just be careful with
different platforms' shell escaping rules.
Do perl command switches on windows/VMS use /e instead of -e? If so,
that'd be a great argument doing just "perl filename".
V2 of this patch:
Changes:
* rebased to most recent master
* removed non-tap test that assumed the existence of Unix sed program
* added non-tap test that assumes the existence of perl
* switched from filename/program to filename/is_program to more closely
follow patterns in copy.c
* slight wording change in C comments
On Thu, Sep 8, 2016 at 6:59 PM, Craig Ringer <craig.ringer@2ndquadrant.com>
wrote:
Show quoted text
On 9 Sep. 2016 03:45, "Corey Huinker" <corey.huinker@gmail.com> wrote:
Stylistically, would a separate .pl file for the emitter be preferable
to something inline like
perl -e 'print "a\tb\tcc\t4\n"; print "b\tc\tdd\t5\n"'
I'd be fine with that and a suitable comment. Just be careful with
different platforms' shell escaping rules.