Make COPY format extendable: Extract COPY TO format implementations
Hi,
I want to work on making COPY format extendable. I attach
the first patch for it. I'll send more patches after this is
merged.
Background:
Currently, COPY TO/FROM supports only "text", "csv" and
"binary" formats. There are some requests to support more
COPY formats. For example:
* 2023-11: JSON and JSON lines [1]/messages/by-id/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1@joeconway.com
* 2022-04: Apache Arrow [2]/messages/by-id/CAGrfaBVyfm0wPzXVqm0=h5uArYh9N_ij+sVpUtDHqkB=VyB3jw@mail.gmail.com
* 2018-02: Apache Avro, Apache Parquet and Apache ORC [3]/messages/by-id/20180210151304.fonjztsynewldfba@gmail.com
(FYI: I want to add support for Apache Arrow.)
There were discussions how to add support for more formats. [3]/messages/by-id/20180210151304.fonjztsynewldfba@gmail.com[4]/messages/by-id/3741749.1655952719@sss.pgh.pa.us
In these discussions, we got a consensus about making COPY
format extendable.
But it seems that nobody works on this yet. So I want to
work on this. (If there is anyone who wants to work on this
together, I'm happy.)
Summary:
The attached patch introduces CopyToFormatOps struct that is
similar to TupleTableSlotOps for TupleTableSlot but
CopyToFormatOps is for COPY TO format. CopyToFormatOps has
routines to implement a COPY TO format.
The attached patch doesn't change:
* the current behavior (all existing tests are still passed
without changing them)
* the existing "text", "csv" and "binary" format output
implementations including local variable names (the
attached patch just move them and adjust indent)
* performance (no significant loss of performance)
In other words, this is just a refactoring for further
changes to make COPY format extendable. If I use "complete
the task and then request reviews for it" approach, it will
be difficult to review because changes for it will be
large. So I want to work on this step by step. Is it
acceptable?
TODOs that should be done in subsequent patches:
* Add some CopyToState readers such as CopyToStateGetDest(),
CopyToStateGetAttnums() and CopyToStateGetOpts()
(We will need to consider which APIs should be exported.)
(This is for implemeing COPY TO format by extension.)
* Export CopySend*() in src/backend/commands/copyto.c
(This is for implemeing COPY TO format by extension.)
* Add API to register a new COPY TO format implementation
* Add "CREATE XXX" to register a new COPY TO format (or COPY
TO/FROM format) implementation
("CREATE COPY HANDLER" was suggested in [5]/messages/by-id/20180211211235.5x3jywe5z3lkgcsr@alap3.anarazel.de.)
* Same for COPY FROM
Performance:
We got a consensus about making COPY format extendable but
we should care about performance. [6]/messages/by-id/3741749.1655952719@sss.pgh.pa.us
I think that step 1 ought to be to convert the existing
formats into plug-ins, and demonstrate that there's no
significant loss of performance.
So I measured COPY TO time with/without this change. You can
see there is no significant loss of performance.
Data: Random 32 bit integers:
CREATE TABLE data (int32 integer);
INSERT INTO data
SELECT random() * 10000
FROM generate_series(1, ${n_records});
The number of records: 100K, 1M and 10M
100K without this change:
format,elapsed time (ms)
text,22.527
csv,23.822
binary,24.806
100K with this change:
format,elapsed time (ms)
text,22.919
csv,24.643
binary,24.705
1M without this change:
format,elapsed time (ms)
text,223.457
csv,233.583
binary,242.687
1M with this change:
format,elapsed time (ms)
text,224.591
csv,233.964
binary,247.164
10M without this change:
format,elapsed time (ms)
text,2330.383
csv,2411.394
binary,2590.817
10M with this change:
format,elapsed time (ms)
text,2231.307
csv,2408.067
binary,2473.617
[1]: /messages/by-id/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1@joeconway.com
[2]: /messages/by-id/CAGrfaBVyfm0wPzXVqm0=h5uArYh9N_ij+sVpUtDHqkB=VyB3jw@mail.gmail.com
[3]: /messages/by-id/20180210151304.fonjztsynewldfba@gmail.com
[4]: /messages/by-id/3741749.1655952719@sss.pgh.pa.us
[5]: /messages/by-id/20180211211235.5x3jywe5z3lkgcsr@alap3.anarazel.de
[6]: /messages/by-id/3741749.1655952719@sss.pgh.pa.us
Thanks,
--
kou
Attachments:
v1-0001-Extract-COPY-TO-format-implementations.patchtext/x-patch; charset=us-asciiDownload+269-157
On Mon, Dec 04, 2023 at 03:35:48PM +0900, Sutou Kouhei wrote:
I want to work on making COPY format extendable. I attach
the first patch for it. I'll send more patches after this is
merged.
Given the current discussion about adding JSON, I think this could be a
nice bit of refactoring that could ultimately open the door to providing
other COPY formats via shared libraries.
In other words, this is just a refactoring for further
changes to make COPY format extendable. If I use "complete
the task and then request reviews for it" approach, it will
be difficult to review because changes for it will be
large. So I want to work on this step by step. Is it
acceptable?
I think it makes sense to do this part independently, but we should be
careful to design this with the follow-up tasks in mind.
So I measured COPY TO time with/without this change. You can
see there is no significant loss of performance.Data: Random 32 bit integers:
CREATE TABLE data (int32 integer);
INSERT INTO data
SELECT random() * 10000
FROM generate_series(1, ${n_records});
Seems encouraging. I assume the performance concerns stem from the use of
function pointers. Or was there something else?
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
Thanks for replying to this proposal!
In <20231205182458.GC2757816@nathanxps13>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Tue, 5 Dec 2023 12:24:58 -0600,
Nathan Bossart <nathandbossart@gmail.com> wrote:
I think it makes sense to do this part independently, but we should be
careful to design this with the follow-up tasks in mind.
OK. I'll keep updating the "TODOs" section in the original
e-mail. It also includes design in the follow-up tasks. We
can discuss the design separately from the patches
submitting. (The current submitted patch just focuses on
refactoring but we can discuss the final design.)
I assume the performance concerns stem from the use of
function pointers. Or was there something else?
I think so too.
The original e-mail that mentioned the performance concern
[1]: /messages/by-id/3741749.1655952719@sss.pgh.pa.us
pointers might be concerned.
If the currently supported formats ("text", "csv" and
"binary") are implemented as an extension, it may have more
concerns but we will keep them as built-in formats for
compatibility. So I think that no more concerns exist for
these formats.
[1]: /messages/by-id/3741749.1655952719@sss.pgh.pa.us
Thanks,
--
kou
On Wed, Dec 6, 2023 at 10:45 AM Sutou Kouhei <kou@clear-code.com> wrote:
Hi,
Thanks for replying to this proposal!
In <20231205182458.GC2757816@nathanxps13>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Tue, 5 Dec 2023 12:24:58 -0600,
Nathan Bossart <nathandbossart@gmail.com> wrote:I think it makes sense to do this part independently, but we should be
careful to design this with the follow-up tasks in mind.OK. I'll keep updating the "TODOs" section in the original
e-mail. It also includes design in the follow-up tasks. We
can discuss the design separately from the patches
submitting. (The current submitted patch just focuses on
refactoring but we can discuss the final design.)I assume the performance concerns stem from the use of
function pointers. Or was there something else?I think so too.
The original e-mail that mentioned the performance concern
[1] didn't say about the reason but the use of function
pointers might be concerned.If the currently supported formats ("text", "csv" and
"binary") are implemented as an extension, it may have more
concerns but we will keep them as built-in formats for
compatibility. So I think that no more concerns exist for
these formats.
For the modern formats(parquet, orc, avro, etc.), will they be
implemented as extensions or in core?
The patch looks good except for a pair of extra curly braces.
[1]: /messages/by-id/3741749.1655952719@sss.pgh.pa.us
Thanks,
--
kou
--
Regards
Junwang Zhao
Hi,
In <CAEG8a3Jf7kPV3ez5OHu-pFGscKfVyd9KkubMF199etkfz=EPRg@mail.gmail.com>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 6 Dec 2023 11:18:35 +0800,
Junwang Zhao <zhjwpku@gmail.com> wrote:
For the modern formats(parquet, orc, avro, etc.), will they be
implemented as extensions or in core?
I think that they should be implemented as extensions
because they will depend of external libraries and may not
use C. For example, C++ will be used for Apache Parquet
because the official Apache Parquet C++ implementation
exists but the C implementation doesn't.
(I can implement an extension for Apache Parquet after we
complete this feature. I'll implement an extension for
Apache Arrow with the official Apache Arrow C++
implementation. And it's easy that we convert Apache Arrow
data to Apache Parquet with the official Apache Parquet
implementation.)
The patch looks good except for a pair of extra curly braces.
Thanks for the review! I attach the v2 patch that removes
extra curly braces for "if (isnull)".
Thanks,
--
kou
Attachments:
v2-0001-Extract-COPY-TO-format-implementations.patchtext/x-patch; charset=us-asciiDownload+265-157
On Wed, Dec 6, 2023 at 2:19 PM Sutou Kouhei <kou@clear-code.com> wrote:
Hi,
In <CAEG8a3Jf7kPV3ez5OHu-pFGscKfVyd9KkubMF199etkfz=EPRg@mail.gmail.com>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 6 Dec 2023 11:18:35 +0800,
Junwang Zhao <zhjwpku@gmail.com> wrote:For the modern formats(parquet, orc, avro, etc.), will they be
implemented as extensions or in core?I think that they should be implemented as extensions
because they will depend of external libraries and may not
use C. For example, C++ will be used for Apache Parquet
because the official Apache Parquet C++ implementation
exists but the C implementation doesn't.(I can implement an extension for Apache Parquet after we
complete this feature. I'll implement an extension for
Apache Arrow with the official Apache Arrow C++
implementation. And it's easy that we convert Apache Arrow
data to Apache Parquet with the official Apache Parquet
implementation.)The patch looks good except for a pair of extra curly braces.
Thanks for the review! I attach the v2 patch that removes
extra curly braces for "if (isnull)".
For the extra curly braces, I mean the following code block in
CopyToFormatBinaryStart:
+ { <-- I thought this is useless?
+ /* Generate header for a binary copy */
+ int32 tmp;
+
+ /* Signature */
+ CopySendData(cstate, BinarySignature, 11);
+ /* Flags field */
+ tmp = 0;
+ CopySendInt32(cstate, tmp);
+ /* No header extension */
+ tmp = 0;
+ CopySendInt32(cstate, tmp);
+ }
Thanks,
--
kou
--
Regards
Junwang Zhao
Hi,
In <CAEG8a3K9dE2gt3+K+h=DwTqMenR84aeYuYS+cty3SR3LAeDBAQ@mail.gmail.com>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 6 Dec 2023 15:11:34 +0800,
Junwang Zhao <zhjwpku@gmail.com> wrote:
For the extra curly braces, I mean the following code block in
CopyToFormatBinaryStart:+ { <-- I thought this is useless? + /* Generate header for a binary copy */ + int32 tmp; + + /* Signature */ + CopySendData(cstate, BinarySignature, 11); + /* Flags field */ + tmp = 0; + CopySendInt32(cstate, tmp); + /* No header extension */ + tmp = 0; + CopySendInt32(cstate, tmp); + }
Oh, I see. I've removed and attach the v3 patch. In general,
I don't change variable name and so on in this patch. I just
move codes in this patch. But I also removed the "tmp"
variable for this case because I think that the name isn't
suitable for larger scope. (I think that "tmp" is acceptable
in a small scope like the above code.)
New code:
/* Generate header for a binary copy */
/* Signature */
CopySendData(cstate, BinarySignature, 11);
/* Flags field */
CopySendInt32(cstate, 0);
/* No header extension */
CopySendInt32(cstate, 0);
Thanks,
--
kou
Attachments:
v3-0001-Extract-COPY-TO-format-implementations.patchtext/x-patch; charset=us-asciiDownload+259-157
Sutou Kouhei wrote:
* 2022-04: Apache Arrow [2]
* 2018-02: Apache Avro, Apache Parquet and Apache ORC [3](FYI: I want to add support for Apache Arrow.)
There were discussions how to add support for more formats. [3][4]
In these discussions, we got a consensus about making COPY
format extendable.
These formats seem all column-oriented whereas COPY is row-oriented
at the protocol level [1]https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-COPY.
With regard to the procotol, how would it work to support these formats?
[1]: https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-COPY
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
On Wed, Dec 6, 2023 at 3:28 PM Sutou Kouhei <kou@clear-code.com> wrote:
Hi,
In <CAEG8a3K9dE2gt3+K+h=DwTqMenR84aeYuYS+cty3SR3LAeDBAQ@mail.gmail.com>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 6 Dec 2023 15:11:34 +0800,
Junwang Zhao <zhjwpku@gmail.com> wrote:For the extra curly braces, I mean the following code block in
CopyToFormatBinaryStart:+ { <-- I thought this is useless? + /* Generate header for a binary copy */ + int32 tmp; + + /* Signature */ + CopySendData(cstate, BinarySignature, 11); + /* Flags field */ + tmp = 0; + CopySendInt32(cstate, tmp); + /* No header extension */ + tmp = 0; + CopySendInt32(cstate, tmp); + }Oh, I see. I've removed and attach the v3 patch. In general,
I don't change variable name and so on in this patch. I just
move codes in this patch. But I also removed the "tmp"
variable for this case because I think that the name isn't
suitable for larger scope. (I think that "tmp" is acceptable
in a small scope like the above code.)New code:
/* Generate header for a binary copy */
/* Signature */
CopySendData(cstate, BinarySignature, 11);
/* Flags field */
CopySendInt32(cstate, 0);
/* No header extension */
CopySendInt32(cstate, 0);Thanks,
--
kou
Hi Kou,
I read the thread[1]/messages/by-id/20180211211235.5x3jywe5z3lkgcsr@alap3.anarazel.de -- Regards Junwang Zhao you posted and I think Andres's suggestion sounds great.
Should we extract both *copy to* and *copy from* for the first step, in that
case we can add the pg_copy_handler catalog smoothly later.
Attached V4 adds 'extract copy from' and it passed the cirrus ci,
please take a look.
I added a hook *copy_from_end* but this might be removed later if not used.
[1]: /messages/by-id/20180211211235.5x3jywe5z3lkgcsr@alap3.anarazel.de -- Regards Junwang Zhao
--
Regards
Junwang Zhao
Attachments:
v4-0001-Extract-COPY-handlers.patchapplication/octet-stream; name=v4-0001-Extract-COPY-handlers.patchDownload+619-413
On Wed, Dec 6, 2023 at 8:32 PM Daniel Verite <daniel@manitou-mail.org> wrote:
Sutou Kouhei wrote:
* 2022-04: Apache Arrow [2]
* 2018-02: Apache Avro, Apache Parquet and Apache ORC [3](FYI: I want to add support for Apache Arrow.)
There were discussions how to add support for more formats. [3][4]
In these discussions, we got a consensus about making COPY
format extendable.These formats seem all column-oriented whereas COPY is row-oriented
at the protocol level [1].
With regard to the procotol, how would it work to support these formats?
They have kind of *RowGroup* concepts, a bunch of rows goes to a RowBatch
and the data of the same column goes together.
I think they should fit the COPY semantics and there are some FDW out there for
these modern formats, like [1]https://github.com/adjust/parquet_fdw. If we support COPY to deal with the
format, it will
be easier to interact with them(without creating
server/usermapping/foreign table).
[1]: https://github.com/adjust/parquet_fdw
[1] https://www.postgresql.org/docs/current/protocol-flow.html#PROTOCOL-COPY
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
--
Regards
Junwang Zhao
On Wed, Dec 06, 2023 at 10:07:51PM +0800, Junwang Zhao wrote:
I read the thread[1] you posted and I think Andres's suggestion sounds great.
Should we extract both *copy to* and *copy from* for the first step, in that
case we can add the pg_copy_handler catalog smoothly later.Attached V4 adds 'extract copy from' and it passed the cirrus ci,
please take a look.I added a hook *copy_from_end* but this might be removed later if not used.
[1]: /messages/by-id/20180211211235.5x3jywe5z3lkgcsr@alap3.anarazel.de
I was looking at the differences between v3 posted by Sutou-san and
v4 from you, seeing that:
+/* Routines for a COPY HANDLER implementation. */
+typedef struct CopyHandlerOps
{
/* Called when COPY TO is started. This will send a header. */
- void (*start) (CopyToState cstate, TupleDesc tupDesc);
+ void (*copy_to_start) (CopyToState cstate, TupleDesc tupDesc);
/* Copy one row for COPY TO. */
- void (*one_row) (CopyToState cstate, TupleTableSlot *slot);
+ void (*copy_to_one_row) (CopyToState cstate, TupleTableSlot *slot);
/* Called when COPY TO is ended. This will send a trailer. */
- void (*end) (CopyToState cstate);
-} CopyToFormatOps;
+ void (*copy_to_end) (CopyToState cstate);
+
+ void (*copy_from_start) (CopyFromState cstate, TupleDesc tupDesc);
+ bool (*copy_from_next) (CopyFromState cstate, ExprContext *econtext,
+ Datum *values, bool *nulls);
+ void (*copy_from_error_callback) (CopyFromState cstate);
+ void (*copy_from_end) (CopyFromState cstate);
+} CopyHandlerOps;
And we've spent a good deal of time refactoring the copy code so as
the logic behind TO and FROM is split. Having a set of routines that
groups both does not look like a step in the right direction to me,
and v4 is an attempt at solving two problems, while v3 aims to improve
one case. It seems to me that each callback portion should be focused
on staying in its own area of the code, aka copyfrom*.c or copyto*.c.
--
Michael
Hi,
In <CAEG8a3LSRhK601Bn50u71BgfNWm4q3kv-o-KEq=hrbyLbY_EsA@mail.gmail.com>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 6 Dec 2023 22:07:51 +0800,
Junwang Zhao <zhjwpku@gmail.com> wrote:
Should we extract both *copy to* and *copy from* for the first step, in that
case we can add the pg_copy_handler catalog smoothly later.
I don't object it (mixing TO/FROM changes to one patch) but
it may make review difficult. Is it acceptable?
FYI: I planed that I implement TO part, and then FROM part,
and then unify TO/FROM parts if needed. [1]/messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.com
Attached V4 adds 'extract copy from' and it passed the cirrus ci,
please take a look.
Thanks. Here are my comments:
+ /* + * Error is relevant to a particular line. + * + * If line_buf still contains the correct line, print it. + */ + if (cstate->line_buf_valid)
We need to fix the indentation.
+CopyFromFormatBinaryStart(CopyFromState cstate, TupleDesc tupDesc) +{ + FmgrInfo *in_functions; + Oid *typioparams; + Oid in_func_oid; + AttrNumber num_phys_attrs; + + /* + * Pick up the required catalog information for each attribute in the + * relation, including the input function, the element type (to pass to + * the input function), and info about defaults and constraints. (Which + * input function we use depends on text/binary format choice.) + */ + num_phys_attrs = tupDesc->natts; + in_functions = (FmgrInfo *) palloc(num_phys_attrs * sizeof(FmgrInfo)); + typioparams = (Oid *) palloc(num_phys_attrs * sizeof(Oid));
We need to update the comment because defaults and
constraints aren't picked up here.
+CopyFromFormatTextStart(CopyFromState cstate, TupleDesc tupDesc)
...
+ /* + * Pick up the required catalog information for each attribute in the + * relation, including the input function, the element type (to pass to + * the input function), and info about defaults and constraints. (Which + * input function we use depends on text/binary format choice.) + */ + in_functions = (FmgrInfo *) palloc(num_phys_attrs * sizeof(FmgrInfo)); + typioparams = (Oid *) palloc(num_phys_attrs * sizeof(Oid));
ditto.
@@ -1716,15 +1776,6 @@ BeginCopyFrom(ParseState *pstate,
ReceiveCopyBinaryHeader(cstate);
}
I think that this block should be moved to
CopyFromFormatBinaryStart() too. But we need to run it after
we setup inputs such as data_source_cb, pipe and filename...
+/* Routines for a COPY HANDLER implementation. */
+typedef struct CopyHandlerOps
+{
+ /* Called when COPY TO is started. This will send a header. */
+ void (*copy_to_start) (CopyToState cstate, TupleDesc tupDesc);
+
+ /* Copy one row for COPY TO. */
+ void (*copy_to_one_row) (CopyToState cstate, TupleTableSlot *slot);
+
+ /* Called when COPY TO is ended. This will send a trailer. */
+ void (*copy_to_end) (CopyToState cstate);
+
+ void (*copy_from_start) (CopyFromState cstate, TupleDesc tupDesc);
+ bool (*copy_from_next) (CopyFromState cstate, ExprContext *econtext,
+ Datum *values, bool *nulls);
+ void (*copy_from_error_callback) (CopyFromState cstate);
+ void (*copy_from_end) (CopyFromState cstate);
+} CopyHandlerOps;
It seems that "copy_" prefix is redundant. Should we use
"to_start" instead of "copy_to_start" and so on?
BTW, it seems that "COPY FROM (FORMAT json)" may not be implemented. [2]/messages/by-id/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com
We may need to care about NULL copy_from_* cases.
I added a hook *copy_from_end* but this might be removed later if not used.
It may be useful to clean up resources for COPY FROM but the
patch doesn't call the copy_from_end. How about removing it
for now? We can add it and call it from EndCopyFrom() later?
Because it's not needed for now.
I think that we should focus on refactoring instead of
adding a new feature in this patch.
[1]: /messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.com
[2]: /messages/by-id/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.com
Thanks,
--
kou
On Thu, Dec 7, 2023 at 8:39 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Dec 06, 2023 at 10:07:51PM +0800, Junwang Zhao wrote:
I read the thread[1] you posted and I think Andres's suggestion sounds great.
Should we extract both *copy to* and *copy from* for the first step, in that
case we can add the pg_copy_handler catalog smoothly later.Attached V4 adds 'extract copy from' and it passed the cirrus ci,
please take a look.I added a hook *copy_from_end* but this might be removed later if not used.
[1]: /messages/by-id/20180211211235.5x3jywe5z3lkgcsr@alap3.anarazel.de
I was looking at the differences between v3 posted by Sutou-san and
v4 from you, seeing that:+/* Routines for a COPY HANDLER implementation. */ +typedef struct CopyHandlerOps { /* Called when COPY TO is started. This will send a header. */ - void (*start) (CopyToState cstate, TupleDesc tupDesc); + void (*copy_to_start) (CopyToState cstate, TupleDesc tupDesc);/* Copy one row for COPY TO. */ - void (*one_row) (CopyToState cstate, TupleTableSlot *slot); + void (*copy_to_one_row) (CopyToState cstate, TupleTableSlot *slot);/* Called when COPY TO is ended. This will send a trailer. */ - void (*end) (CopyToState cstate); -} CopyToFormatOps; + void (*copy_to_end) (CopyToState cstate); + + void (*copy_from_start) (CopyFromState cstate, TupleDesc tupDesc); + bool (*copy_from_next) (CopyFromState cstate, ExprContext *econtext, + Datum *values, bool *nulls); + void (*copy_from_error_callback) (CopyFromState cstate); + void (*copy_from_end) (CopyFromState cstate); +} CopyHandlerOps;And we've spent a good deal of time refactoring the copy code so as
the logic behind TO and FROM is split. Having a set of routines that
groups both does not look like a step in the right direction to me,
The point of this refactor (from my view) is to make it possible to add new
copy handlers in extensions, just like access method. As Andres suggested,
a system catalog like *pg_copy_handler*, if we split TO and FROM into two
sets of routines, does that mean we have to create two catalog(
pg_copy_from_handler and pg_copy_to_handler)?
and v4 is an attempt at solving two problems, while v3 aims to improve
one case. It seems to me that each callback portion should be focused
on staying in its own area of the code, aka copyfrom*.c or copyto*.c.
--
Michael
--
Regards
Junwang Zhao
On Thu, Dec 7, 2023 at 1:05 PM Sutou Kouhei <kou@clear-code.com> wrote:
Hi,
In <CAEG8a3LSRhK601Bn50u71BgfNWm4q3kv-o-KEq=hrbyLbY_EsA@mail.gmail.com>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 6 Dec 2023 22:07:51 +0800,
Junwang Zhao <zhjwpku@gmail.com> wrote:Should we extract both *copy to* and *copy from* for the first step, in that
case we can add the pg_copy_handler catalog smoothly later.I don't object it (mixing TO/FROM changes to one patch) but
it may make review difficult. Is it acceptable?FYI: I planed that I implement TO part, and then FROM part,
and then unify TO/FROM parts if needed. [1]
I'm fine with step by step refactoring, let's just wait for more
suggestions.
Attached V4 adds 'extract copy from' and it passed the cirrus ci,
please take a look.Thanks. Here are my comments:
+ /* + * Error is relevant to a particular line. + * + * If line_buf still contains the correct line, print it. + */ + if (cstate->line_buf_valid)We need to fix the indentation.
+CopyFromFormatBinaryStart(CopyFromState cstate, TupleDesc tupDesc) +{ + FmgrInfo *in_functions; + Oid *typioparams; + Oid in_func_oid; + AttrNumber num_phys_attrs; + + /* + * Pick up the required catalog information for each attribute in the + * relation, including the input function, the element type (to pass to + * the input function), and info about defaults and constraints. (Which + * input function we use depends on text/binary format choice.) + */ + num_phys_attrs = tupDesc->natts; + in_functions = (FmgrInfo *) palloc(num_phys_attrs * sizeof(FmgrInfo)); + typioparams = (Oid *) palloc(num_phys_attrs * sizeof(Oid));We need to update the comment because defaults and
constraints aren't picked up here.+CopyFromFormatTextStart(CopyFromState cstate, TupleDesc tupDesc)
...
+ /* + * Pick up the required catalog information for each attribute in the + * relation, including the input function, the element type (to pass to + * the input function), and info about defaults and constraints. (Which + * input function we use depends on text/binary format choice.) + */ + in_functions = (FmgrInfo *) palloc(num_phys_attrs * sizeof(FmgrInfo)); + typioparams = (Oid *) palloc(num_phys_attrs * sizeof(Oid));ditto.
@@ -1716,15 +1776,6 @@ BeginCopyFrom(ParseState *pstate,
ReceiveCopyBinaryHeader(cstate);
}I think that this block should be moved to
CopyFromFormatBinaryStart() too. But we need to run it after
we setup inputs such as data_source_cb, pipe and filename...+/* Routines for a COPY HANDLER implementation. */ +typedef struct CopyHandlerOps +{ + /* Called when COPY TO is started. This will send a header. */ + void (*copy_to_start) (CopyToState cstate, TupleDesc tupDesc); + + /* Copy one row for COPY TO. */ + void (*copy_to_one_row) (CopyToState cstate, TupleTableSlot *slot); + + /* Called when COPY TO is ended. This will send a trailer. */ + void (*copy_to_end) (CopyToState cstate); + + void (*copy_from_start) (CopyFromState cstate, TupleDesc tupDesc); + bool (*copy_from_next) (CopyFromState cstate, ExprContext *econtext, + Datum *values, bool *nulls); + void (*copy_from_error_callback) (CopyFromState cstate); + void (*copy_from_end) (CopyFromState cstate); +} CopyHandlerOps;It seems that "copy_" prefix is redundant. Should we use
"to_start" instead of "copy_to_start" and so on?BTW, it seems that "COPY FROM (FORMAT json)" may not be implemented. [2]
We may need to care about NULL copy_from_* cases.I added a hook *copy_from_end* but this might be removed later if not used.
It may be useful to clean up resources for COPY FROM but the
patch doesn't call the copy_from_end. How about removing it
for now? We can add it and call it from EndCopyFrom() later?
Because it's not needed for now.I think that we should focus on refactoring instead of
adding a new feature in this patch.[1]: /messages/by-id/20231204.153548.2126325458835528809.kou@clear-code.com
[2]: /messages/by-id/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU=kcg@mail.gmail.comThanks,
--
kou
--
Regards
Junwang Zhao
On 2023-12-07 Th 03:37, Junwang Zhao wrote:
The point of this refactor (from my view) is to make it possible to add new
copy handlers in extensions, just like access method. As Andres suggested,
a system catalog like *pg_copy_handler*, if we split TO and FROM into two
sets of routines, does that mean we have to create two catalog(
pg_copy_from_handler and pg_copy_to_handler)?
Surely not. Either have two fields, one for the TO handler and one for
the FROM handler, or a flag on each row indicating if it's a FROM or TO
handler.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Fri, Dec 8, 2023 at 1:39 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2023-12-07 Th 03:37, Junwang Zhao wrote:
The point of this refactor (from my view) is to make it possible to add new
copy handlers in extensions, just like access method. As Andres suggested,
a system catalog like *pg_copy_handler*, if we split TO and FROM into two
sets of routines, does that mean we have to create two catalog(
pg_copy_from_handler and pg_copy_to_handler)?Surely not. Either have two fields, one for the TO handler and one for
the FROM handler, or a flag on each row indicating if it's a FROM or TO
handler.
True.
But why do we need a system catalog like pg_copy_handler in the first
place? I imagined that an extension can define a handler function
returning a set of callbacks and the parser can lookup the handler
function by name, like FDW and TABLESAMPLE.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Dec 8, 2023 at 3:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Dec 8, 2023 at 1:39 AM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2023-12-07 Th 03:37, Junwang Zhao wrote:
The point of this refactor (from my view) is to make it possible to add new
copy handlers in extensions, just like access method. As Andres suggested,
a system catalog like *pg_copy_handler*, if we split TO and FROM into two
sets of routines, does that mean we have to create two catalog(
pg_copy_from_handler and pg_copy_to_handler)?Surely not. Either have two fields, one for the TO handler and one for
the FROM handler, or a flag on each row indicating if it's a FROM or TO
handler.
If we wrap the two fields into a single structure, that will still be in
copy.h, which I think is not necessary. A single routing wrapper should
be enough, the actual implementation still stays separate
copy_[to/from].c files.
True.
But why do we need a system catalog like pg_copy_handler in the first
place? I imagined that an extension can define a handler function
returning a set of callbacks and the parser can lookup the handler
function by name, like FDW and TABLESAMPLE.
I can see FDW related utility commands but no TABLESAMPLE related,
and there is a pg_foreign_data_wrapper system catalog which has
a *fdwhandler* field.
If we want extensions to create a new copy handler, I think
something like pg_copy_hander should be necessary.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
I go one step further to implement the pg_copy_handler, attached V5 is
the implementation with some changes suggested by Kou.
You can also review this on this github pull request [1]https://github.com/zhjwpku/postgres/pull/1/files.
[1]: https://github.com/zhjwpku/postgres/pull/1/files
--
Regards
Junwang Zhao
Attachments:
v5-0001-Extract-COPY-handlers.patchapplication/octet-stream; name=v5-0001-Extract-COPY-handlers.patchDownload+839-418
On Fri, Dec 08, 2023 at 10:32:27AM +0800, Junwang Zhao wrote:
I can see FDW related utility commands but no TABLESAMPLE related,
and there is a pg_foreign_data_wrapper system catalog which has
a *fdwhandler* field.
+ */ +CATALOG(pg_copy_handler,4551,CopyHandlerRelationId)
Using a catalog is an over-engineered design. Others have provided
hints about that upthread, but it would be enough to have one or two
handler types that are wrapped around one or two SQL *functions*, like
tablesamples. It seems like you've missed it, but feel free to read
about tablesample-method.sgml, that explains how this is achieved for
tablesamples.
If we want extensions to create a new copy handler, I think
something like pg_copy_hander should be necessary.
A catalog is not necessary, that's the point, because it can be
replaced by a scan of pg_proc with the function name defined in a COPY
query (be it through a FORMAT, or different option in a DefElem).
An example of extension with tablesamples is contrib/tsm_system_rows/,
that just uses a function returning a tsm_handler:
CREATE FUNCTION system_rows(internal)
RETURNS tsm_handler
AS 'MODULE_PATHNAME', 'tsm_system_rows_handler'
LANGUAGE C STRICT;
Then SELECT queries rely on the contents of the TABLESAMPLE clause to
find the set of callbacks it should use by calling the function.
+/* Routines for a COPY HANDLER implementation. */
+typedef struct CopyRoutine
+{
FWIW, I find weird the concept of having one handler for both COPY
FROM and COPY TO as each one of them has callbacks that are mutually
exclusive to the other, but I'm OK if there is a consensus of only
one. So I'd suggest to use *two* NodeTags instead for a cleaner
split, meaning that we'd need two functions for each method. My point
is that a custom COPY handler could just define a COPY TO handler or a
COPY FROM handler, though it mostly comes down to a matter of taste
regarding how clean the error handling becomes if one tries to use a
set of callbacks with a COPY type (TO or FROM) not matching it.
--
Michael
On Fri, Dec 8, 2023 at 2:17 PM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Dec 08, 2023 at 10:32:27AM +0800, Junwang Zhao wrote:
I can see FDW related utility commands but no TABLESAMPLE related,
and there is a pg_foreign_data_wrapper system catalog which has
a *fdwhandler* field.+ */ +CATALOG(pg_copy_handler,4551,CopyHandlerRelationId)
Using a catalog is an over-engineered design. Others have provided
hints about that upthread, but it would be enough to have one or two
handler types that are wrapped around one or two SQL *functions*, like
tablesamples. It seems like you've missed it, but feel free to read
about tablesample-method.sgml, that explains how this is achieved for
tablesamples.
Agreed. My previous example of FDW was not a good one, I missed something.
If we want extensions to create a new copy handler, I think
something like pg_copy_hander should be necessary.A catalog is not necessary, that's the point, because it can be
replaced by a scan of pg_proc with the function name defined in a COPY
query (be it through a FORMAT, or different option in a DefElem).
An example of extension with tablesamples is contrib/tsm_system_rows/,
that just uses a function returning a tsm_handler:
CREATE FUNCTION system_rows(internal)
RETURNS tsm_handler
AS 'MODULE_PATHNAME', 'tsm_system_rows_handler'
LANGUAGE C STRICT;Then SELECT queries rely on the contents of the TABLESAMPLE clause to
find the set of callbacks it should use by calling the function.+/* Routines for a COPY HANDLER implementation. */ +typedef struct CopyRoutine +{FWIW, I find weird the concept of having one handler for both COPY
FROM and COPY TO as each one of them has callbacks that are mutually
exclusive to the other, but I'm OK if there is a consensus of only
one. So I'd suggest to use *two* NodeTags instead for a cleaner
split, meaning that we'd need two functions for each method. My point
is that a custom COPY handler could just define a COPY TO handler or a
COPY FROM handler, though it mostly comes down to a matter of taste
regarding how clean the error handling becomes if one tries to use a
set of callbacks with a COPY type (TO or FROM) not matching it.
I tend to agree to have separate two functions for each method. But
given we implement it in tablesample-way, I think we need to make it
clear how to call one of the two functions depending on COPY TO and
FROM.
IIUC in tablesamples cases, we scan pg_proc to find the handler
function like system_rows(internal) by the method name specified in
the query. On the other hand, in COPY cases, the queries would be
going to be like:
COPY tab TO stdout WITH (format = 'arrow');
and
COPY tab FROM stdin WITH (format = 'arrow');
So a custom COPY extension would not be able to define SQL functions
just like arrow(internal) for example. We might need to define a rule
like the function returning copy_in/out_handler must be defined as
<method name>_to(internal) and <method_name>_from(internal).
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Dec 08, 2023 at 03:42:06PM +0900, Masahiko Sawada wrote:
So a custom COPY extension would not be able to define SQL functions
just like arrow(internal) for example. We might need to define a rule
like the function returning copy_in/out_handler must be defined as
<method name>_to(internal) and <method_name>_from(internal).
Yeah, I was wondering if there was a trick to avoid the input internal
argument conflict, but cannot recall something elegant on the top of
my mind. Anyway, I'd be OK with any approach as long as it plays
nicely with the query integration, and that's FORMAT's DefElem with
its string value to do the function lookups.
--
Michael