COPY TO STDOUT Apache Arrow support
Hi,
would it be possible to add Apache Arrow streaming format to the copy
backend + frontend?
The use case is fetching (or storing) tens or hundreds of millions of rows
for client side data science purposes (Pandas, Apache Arrow compute
kernels, Parquet conversion etc). It looks like the serialization overhead
when using the postgresql wire format can be significant.
Best regards,
Adam Lippai
Hi,
There are two bigger developments in this topic:
1. Pandas 2.0 is released and it can use Apache Arrow as a backend
2. Apache Arrow ADBC is released which standardizes the client API.
Currently it uses the postgresql wire protocol underneath
Best regards,
Adam Lippai
On Thu, Apr 21, 2022 at 10:41 AM Adam Lippai <adam@rigo.sk> wrote:
Show quoted text
Hi,
would it be possible to add Apache Arrow streaming format to the copy
backend + frontend?
The use case is fetching (or storing) tens or hundreds of millions of rows
for client side data science purposes (Pandas, Apache Arrow compute
kernels, Parquet conversion etc). It looks like the serialization overhead
when using the postgresql wire format can be significant.Best regards,
Adam Lippai
Hi,
There is also a new Arrow C library (one .h and one .c file) which makes it
easier to use it from the postgresql codebase.
https://arrow.apache.org/blog/2023/03/07/nanoarrow-0.1.0-release/
https://github.com/apache/arrow-nanoarrow/tree/main/dist
Best regards,
Adam Lippai
On Thu, Apr 13, 2023 at 2:35 PM Adam Lippai <adam@rigo.sk> wrote:
Show quoted text
Hi,
There are two bigger developments in this topic:
1. Pandas 2.0 is released and it can use Apache Arrow as a backend
2. Apache Arrow ADBC is released which standardizes the client API.
Currently it uses the postgresql wire protocol underneathBest regards,
Adam LippaiOn Thu, Apr 21, 2022 at 10:41 AM Adam Lippai <adam@rigo.sk> wrote:
Hi,
would it be possible to add Apache Arrow streaming format to the copy
backend + frontend?
The use case is fetching (or storing) tens or hundreds of millions of
rows for client side data science purposes (Pandas, Apache Arrow compute
kernels, Parquet conversion etc). It looks like the serialization overhead
when using the postgresql wire format can be significant.Best regards,
Adam Lippai
Hi
st 3. 5. 2023 v 5:15 odesílatel Adam Lippai <adam@rigo.sk> napsal:
Hi,
There is also a new Arrow C library (one .h and one .c file) which makes
it easier to use it from the postgresql codebase.https://arrow.apache.org/blog/2023/03/07/nanoarrow-0.1.0-release/
https://github.com/apache/arrow-nanoarrow/tree/main/distBest regards,
Adam Lippai
With 9fcdf2c787ac6da330165ea3cd50ec5155943a2b it can be implemented in
extension
Regards
Pavel
Show quoted text
On Thu, Apr 13, 2023 at 2:35 PM Adam Lippai <adam@rigo.sk> wrote:
Hi,
There are two bigger developments in this topic:
1. Pandas 2.0 is released and it can use Apache Arrow as a backend
2. Apache Arrow ADBC is released which standardizes the client API.
Currently it uses the postgresql wire protocol underneathBest regards,
Adam LippaiOn Thu, Apr 21, 2022 at 10:41 AM Adam Lippai <adam@rigo.sk> wrote:
Hi,
would it be possible to add Apache Arrow streaming format to the copy
backend + frontend?
The use case is fetching (or storing) tens or hundreds of millions of
rows for client side data science purposes (Pandas, Apache Arrow compute
kernels, Parquet conversion etc). It looks like the serialization overhead
when using the postgresql wire format can be significant.Best regards,
Adam Lippai