Make COPY extendable in order to support Parquet and other formats

Started by Aleksander Alekseevover 3 years ago11 messages
#1Aleksander Alekseev
aleksander@timescale.com

Hi hackers,

In several conversations I had recently with colleagues it was pointed
out that it would be great if PostgreSQL supported COPY to/from
Parquet and other formats. I've found a corresponding discussion [1]/messages/by-id/20180210151304.fonjztsynewldfba@gmail.com
on pgsql-general@. The consensus reached back in 2018 seems to be that
this shouldn't be implemented in the core but rather an API should be
provided for the extensions. To my knowledge this was never
implemented though.

I would like to invest some time into providing a corresponding patch
for the core and implementing "pg_copy_parquet" extension as a
practical example, and yet another, a bit simpler, extension as an API
usage example for the core codebase. I just wanted to double-check
that this is still a wanted feature and no one on pgsql-hackers@
objects the idea.

Any feedback, suggestions and ideas are most welcome.

[1]: /messages/by-id/20180210151304.fonjztsynewldfba@gmail.com

--
Best regards,
Aleksander Alekseev

#2Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Aleksander Alekseev (#1)
Re: Make COPY extendable in order to support Parquet and other formats

On Mon, Jun 20, 2022 at 8:35 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:

I would like to invest some time into providing a corresponding patch
for the core and implementing "pg_copy_parquet" extension as a
practical example, and yet another, a bit simpler, extension as an API
usage example for the core codebase. I just wanted to double-check
that this is still a wanted feature and no one on pgsql-hackers@
objects the idea.

An extension just for COPY to/from parquet looks limited in
functionality. Shouldn't this be viewed as an FDW or Table AM support
for parquet or other formats? Of course the later is much larger in
scope compared to the first one. But there may already be efforts
underway [1]
https://www.postgresql.org/about/news/parquet-s3-fdw-01-was-newly-released-2179/

I have not used it myself or worked with it.

--
Best Wishes,
Ashutosh Bapat

#3Aleksander Alekseev
aleksander@timescale.com
In reply to: Ashutosh Bapat (#2)
Re: Make COPY extendable in order to support Parquet and other formats

Hi Ashutosh,

An extension just for COPY to/from parquet looks limited in
functionality. Shouldn't this be viewed as an FDW or Table AM support
for parquet or other formats? Of course the later is much larger in
scope compared to the first one. But there may already be efforts
underway
https://www.postgresql.org/about/news/parquet-s3-fdw-01-was-newly-released-2179/

Many thanks for sharing your thoughts on this!

We are using parquet_fdw [2]https://github.com/adjust/parquet_fdw but this is a read-only FDW.

What users typically need is to dump their data as fast as possible in
a given format and either to upload it to the cloud as historical data
or to transfer it to another system (Spark, etc). The data can be
accessed later if needed, as read only one.

Note that when accessing the historical data with parquet_fdw you
basically have a zero ingestion time.

Another possible use case is transferring data to PostgreSQL from
another source. Here the requirements are similar - the data should be
dumped as fast as possible from the source, transferred over the
network and imported as fast as possible.

In other words, personally I'm unaware of use cases when somebody
needs a complete read/write FDW or TableAM implementation for formats
like Parquet, ORC, etc. Also to my knowledge they are not particularly
optimized for this.

[2]: https://github.com/adjust/parquet_fdw

--
Best regards,
Aleksander Alekseev

#4Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
In reply to: Aleksander Alekseev (#3)
Re: Make COPY extendable in order to support Parquet and other formats

On Tue, Jun 21, 2022 at 3:26 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:

In other words, personally I'm unaware of use cases when somebody
needs a complete read/write FDW or TableAM implementation for formats
like Parquet, ORC, etc. Also to my knowledge they are not particularly
optimized for this.

IIUC, you want extensibility in FORMAT argument to COPY command
https://www.postgresql.org/docs/current/sql-copy.html. Where the
format is pluggable. That seems useful.
Another option is to dump the data in csv format but use external
utility to convert csv to parquet or whatever other format is. I
understand that that's not going to be as efficient as dumping
directly in the desired format.

--
Best Wishes,
Ashutosh Bapat

#5Aleksander Alekseev
aleksander@timescale.com
In reply to: Ashutosh Bapat (#4)
Re: Make COPY extendable in order to support Parquet and other formats

Hi Ashutosh,

IIUC, you want extensibility in FORMAT argument to COPY command
https://www.postgresql.org/docs/current/sql-copy.html. Where the
format is pluggable. That seems useful.
Another option is to dump the data in csv format but use external
utility to convert csv to parquet or whatever other format is. I
understand that that's not going to be as efficient as dumping
directly in the desired format.

Exactly. However, to clarify, I suspect this may be a bit more
involved than simply extending the FORMAT arguments.

This change per se will not be extremely useful. Currently nothing
prevents an extension author to iterate over a table using
heap_open(), heap_getnext(), etc API and dump its content in any
format. The user will have to write "dump_table(foo, filename)"
instead of "COPY ..." but that's not a big deal.

The problem is that every new extension has to re-invent things like
figuring out the schema, the validation of the data, etc. If we could
do this in the core so that an extension author has to implement only
the minimal format-dependent list of callbacks that would be really
great. In order to make the interface practical though one will have
to implement a practical extension as well, for instance, a Parquet
one.

This being said, if it turns out that for some reason this is not
realistic to deliver, ending up with simply extending this part of the
syntax a bit should be fine too.

--
Best regards,
Aleksander Alekseev

#6Andres Freund
andres@anarazel.de
In reply to: Ashutosh Bapat (#4)
Re: Make COPY extendable in order to support Parquet and other formats

Hi,

On 2022-06-22 16:59:16 +0530, Ashutosh Bapat wrote:

On Tue, Jun 21, 2022 at 3:26 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:

In other words, personally I'm unaware of use cases when somebody
needs a complete read/write FDW or TableAM implementation for formats
like Parquet, ORC, etc. Also to my knowledge they are not particularly
optimized for this.

IIUC, you want extensibility in FORMAT argument to COPY command
https://www.postgresql.org/docs/current/sql-copy.html. Where the
format is pluggable. That seems useful.

Agreed.

But I think it needs quite a bit of care. Just plugging in a bunch of per-row
(or worse, per field) switches to COPYs input / output parsing will make the
code even harder to read and even slower.

I suspect that we'd first need a patch to refactor the existing copy code a
good bit to clean things up. After that it hopefully will be possible to plug
in a new format without being too intrusive.

I know little about parquet - can it support FROM STDIN efficiently?

Greetings,

Andres Freund

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#6)
Re: Make COPY extendable in order to support Parquet and other formats

Andres Freund <andres@anarazel.de> writes:

On 2022-06-22 16:59:16 +0530, Ashutosh Bapat wrote:

IIUC, you want extensibility in FORMAT argument to COPY command
https://www.postgresql.org/docs/current/sql-copy.html. Where the
format is pluggable. That seems useful.

Agreed.

Ditto.

I suspect that we'd first need a patch to refactor the existing copy code a
good bit to clean things up. After that it hopefully will be possible to plug
in a new format without being too intrusive.

I think that step 1 ought to be to convert the existing formats into
plug-ins, and demonstrate that there's no significant loss of performance.

regards, tom lane

#8Aleksander Alekseev
aleksander@timescale.com
In reply to: Tom Lane (#7)
Re: Make COPY extendable in order to support Parquet and other formats

Andres, Tom,

I suspect that we'd first need a patch to refactor the existing copy code a
good bit to clean things up. After that it hopefully will be possible to plug
in a new format without being too intrusive.

I think that step 1 ought to be to convert the existing formats into
plug-ins, and demonstrate that there's no significant loss of performance.

Yep, this looks like a promising strategy to me too.

I know little about parquet - can it support FROM STDIN efficiently?

Parquet is a compressed binary format with data grouped by columns
[1]: https://parquet.apache.org/docs/file-format/
particular format.

[1]: https://parquet.apache.org/docs/file-format/

--
Best regards,
Aleksander Alekseev

#9Andres Freund
andres@anarazel.de
In reply to: Aleksander Alekseev (#8)
Re: Make COPY extendable in order to support Parquet and other formats

Hi,

On 2022-06-23 11:38:29 +0300, Aleksander Alekseev wrote:

I know little about parquet - can it support FROM STDIN efficiently?

Parquet is a compressed binary format with data grouped by columns
[1]. I wouldn't assume that this is a primary use case for this
particular format.

IMO decent COPY FROM / TO STDIN support is crucial, because otherwise you
can't do COPY from/to a client. Which would make the feature unusable for
anybody not superuser, including just about all users of hosted PG.

Greetings,

Andres Freund

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#9)
Re: Make COPY extendable in order to support Parquet and other formats

On 2022-06-23 Th 21:45, Andres Freund wrote:

Hi,

On 2022-06-23 11:38:29 +0300, Aleksander Alekseev wrote:

I know little about parquet - can it support FROM STDIN efficiently?

Parquet is a compressed binary format with data grouped by columns
[1]. I wouldn't assume that this is a primary use case for this
particular format.

IMO decent COPY FROM / TO STDIN support is crucial, because otherwise you
can't do COPY from/to a client. Which would make the feature unusable for
anybody not superuser, including just about all users of hosted PG.

+1

Note that Parquet puts the metadata at the end of each file, which makes
it nice to write but somewhat unfriendly for streaming readers, which
would have to accumulate the whole file in order to process it.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#11Aleksander Alekseev
aleksander@timescale.com
In reply to: Andrew Dunstan (#10)
Re: Make COPY extendable in order to support Parquet and other formats

Hi Andrew,

IMO decent COPY FROM / TO STDIN support is crucial, because otherwise you
can't do COPY from/to a client. Which would make the feature unusable for
anybody not superuser, including just about all users of hosted PG.

+1

Note that Parquet puts the metadata at the end of each file, which makes
it nice to write but somewhat unfriendly for streaming readers, which
would have to accumulate the whole file in order to process it.

It's not necessarily that bad since data is divided into pages, each
page can be processed separately. However personally I have limited
experience with Parquet at this point. Some experimentation is
required. I will keep in mind the requirement regarding COPY FROM / TO
STDIN.

--
Best regards,
Aleksander Alekseev