Make COPY format extendable: Extract COPY TO format implementations

Started by Sutou Kouheiover 2 years ago335 messageshackers

kou@clear-code.com

over 2 years ago

Hi,

I want to work on making COPY format extendable. I attach
the first patch for it. I'll send more patches after this is
merged.

Background:

Currently, COPY TO/FROM supports only "text", "csv" and
"binary" formats. There are some requests to support more
COPY formats. For example:

* 2023-11: JSON and JSON lines [1]/messages/by-id/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1@joeconway.com
* 2022-04: Apache Arrow [2]/messages/by-id/CAGrfaBVyfm0wPzXVqm0=h5uArYh9N_ij+sVpUtDHqkB=VyB3jw@mail.gmail.com
* 2018-02: Apache Avro, Apache Parquet and Apache ORC [3]/messages/by-id/20180210151304.fonjztsynewldfba@gmail.com

(FYI: I want to add support for Apache Arrow.)

There were discussions how to add support for more formats. [3]/messages/by-id/20180210151304.fonjztsynewldfba@gmail.com[4]/messages/by-id/3741749.1655952719@sss.pgh.pa.us
In these discussions, we got a consensus about making COPY
format extendable.

But it seems that nobody works on this yet. So I want to
work on this. (If there is anyone who wants to work on this
together, I'm happy.)

Summary:

The attached patch introduces CopyToFormatOps struct that is
similar to TupleTableSlotOps for TupleTableSlot but
CopyToFormatOps is for COPY TO format. CopyToFormatOps has
routines to implement a COPY TO format.

The attached patch doesn't change:

* the current behavior (all existing tests are still passed
without changing them)
* the existing "text", "csv" and "binary" format output
implementations including local variable names (the
attached patch just move them and adjust indent)
* performance (no significant loss of performance)

In other words, this is just a refactoring for further
changes to make COPY format extendable. If I use "complete
the task and then request reviews for it" approach, it will
be difficult to review because changes for it will be
large. So I want to work on this step by step. Is it
acceptable?

TODOs that should be done in subsequent patches:

* Add some CopyToState readers such as CopyToStateGetDest(),
CopyToStateGetAttnums() and CopyToStateGetOpts()
(We will need to consider which APIs should be exported.)
(This is for implemeing COPY TO format by extension.)
* Export CopySend*() in src/backend/commands/copyto.c
(This is for implemeing COPY TO format by extension.)
* Add API to register a new COPY TO format implementation
* Add "CREATE XXX" to register a new COPY TO format (or COPY
TO/FROM format) implementation
("CREATE COPY HANDLER" was suggested in [5]/messages/by-id/20180211211235.5x3jywe5z3lkgcsr@alap3.anarazel.de.)
* Same for COPY FROM

Performance:

We got a consensus about making COPY format extendable but
we should care about performance. [6]/messages/by-id/3741749.1655952719@sss.pgh.pa.us

I think that step 1 ought to be to convert the existing
formats into plug-ins, and demonstrate that there's no
significant loss of performance.

So I measured COPY TO time with/without this change. You can
see there is no significant loss of performance.

Data: Random 32 bit integers:

CREATE TABLE data (int32 integer);
INSERT INTO data
SELECT random() * 10000
FROM generate_series(1, ${n_records});

The number of records: 100K, 1M and 10M

100K without this change:

format,elapsed time (ms)
text,22.527
csv,23.822
binary,24.806

100K with this change:

format,elapsed time (ms)
text,22.919
csv,24.643
binary,24.705

1M without this change:

format,elapsed time (ms)
text,223.457
csv,233.583
binary,242.687

1M with this change:

format,elapsed time (ms)
text,224.591
csv,233.964
binary,247.164

10M without this change:

format,elapsed time (ms)
text,2330.383
csv,2411.394
binary,2590.817

10M with this change:

format,elapsed time (ms)
text,2231.307
csv,2408.067
binary,2473.617

[1]: /messages/by-id/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1@joeconway.com
[2]: /messages/by-id/CAGrfaBVyfm0wPzXVqm0=h5uArYh9N_ij+sVpUtDHqkB=VyB3jw@mail.gmail.com
[3]: /messages/by-id/20180210151304.fonjztsynewldfba@gmail.com
[4]: /messages/by-id/3741749.1655952719@sss.pgh.pa.us
[5]: /messages/by-id/20180211211235.5x3jywe5z3lkgcsr@alap3.anarazel.de
[6]: /messages/by-id/3741749.1655952719@sss.pgh.pa.us

Thanks,
--
kou

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Sutou Kouhei (#1)

Re: Make COPY format extendable: Extract COPY TO format implementations

On Mon, Dec 04, 2023 at 03:35:48PM +0900, Sutou Kouhei wrote:

I want to work on making COPY format extendable. I attach
the first patch for it. I'll send more patches after this is
merged.

Given the current discussion about adding JSON, I think this could be a
nice bit of refactoring that could ultimately open the door to providing
other COPY formats via shared libraries.

In other words, this is just a refactoring for further
changes to make COPY format extendable. If I use "complete
the task and then request reviews for it" approach, it will
be difficult to review because changes for it will be
large. So I want to work on this step by step. Is it
acceptable?

I think it makes sense to do this part independently, but we should be
careful to design this with the follow-up tasks in mind.

So I measured COPY TO time with/without this change. You can
see there is no significant loss of performance.

Data: Random 32 bit integers:

CREATE TABLE data (int32 integer);
INSERT INTO data
SELECT random() * 10000
FROM generate_series(1, ${n_records});

Seems encouraging. I assume the performance concerns stem from the use of
function pointers. Or was there something else?

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Make COPY format extendable: Extract COPY TO format implementations

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: