How to import Apache parquet files?

Started by Softwarelimitsover 6 years ago5 messagesgeneral
Jump to latest
#1Softwarelimits
softwarelimits@gmail.com

Hi, I need to come and ask here, I did not find enough information so I
hope I am just having a bad day or somebody is censoring my search results
for fun... :)

I would like to import (lots of) Apache parquet files to a PostgreSQL 11
cluster - yes, I believe it should be done with the Python pyarrow module,
but before digging into the possible traps I would like to ask here if
there is some common, well understood and documented tool that may be
helpful with that process?

It seems that the COPY command can import binary data, but I am not able to
allocate enough resources to understand how to implement a parquet file
import with that.

I really would like follow a person with much more knowledge than me about
either PostgreSQL or Apache parquet format instead of inventing a bad
wheel.

Any hints very welcome,
thank you very much for your attention!
John

#2Imre Samu
pella.samu@gmail.com
In reply to: Softwarelimits (#1)
Re: How to import Apache parquet files?

I would like to import (lots of) Apache parquet files to a PostgreSQL 11

cluster

imho: You have to check and test the Parquet FDW ( Parquet File Wrapper )
- https://github.com/adjust/parquet_fdw

Imre

Softwarelimits <softwarelimits@gmail.com> ezt írta (időpont: 2019. nov. 5.,
K, 15:57):

Show quoted text

Hi, I need to come and ask here, I did not find enough information so I
hope I am just having a bad day or somebody is censoring my search results
for fun... :)

I would like to import (lots of) Apache parquet files to a PostgreSQL 11
cluster - yes, I believe it should be done with the Python pyarrow module,
but before digging into the possible traps I would like to ask here if
there is some common, well understood and documented tool that may be
helpful with that process?

It seems that the COPY command can import binary data, but I am not able
to allocate enough resources to understand how to implement a parquet file
import with that.

I really would like follow a person with much more knowledge than me about
either PostgreSQL or Apache parquet format instead of inventing a bad
wheel.

Any hints very welcome,
thank you very much for your attention!
John

#3Softwarelimits
softwarelimits@gmail.com
In reply to: Imre Samu (#2)
Re: How to import Apache parquet files?

Hi Imre, thanks for the quick response - yes, I found that, but I was not
sure if it is already production ready - also I would like to use the data
with the timescale extension, that is why I need a full import.

Have nice day!

On Tue, Nov 5, 2019 at 4:09 PM Imre Samu <pella.samu@gmail.com> wrote:

Show quoted text

I would like to import (lots of) Apache parquet files to a PostgreSQL 11

cluster

imho: You have to check and test the Parquet FDW ( Parquet File Wrapper )
- https://github.com/adjust/parquet_fdw

Imre

Softwarelimits <softwarelimits@gmail.com> ezt írta (időpont: 2019. nov.
5., K, 15:57):

Hi, I need to come and ask here, I did not find enough information so I
hope I am just having a bad day or somebody is censoring my search results
for fun... :)

I would like to import (lots of) Apache parquet files to a PostgreSQL 11
cluster - yes, I believe it should be done with the Python pyarrow module,
but before digging into the possible traps I would like to ask here if
there is some common, well understood and documented tool that may be
helpful with that process?

It seems that the COPY command can import binary data, but I am not able
to allocate enough resources to understand how to implement a parquet file
import with that.

I really would like follow a person with much more knowledge than me
about either PostgreSQL or Apache parquet format instead of inventing a bad
wheel.

Any hints very welcome,
thank you very much for your attention!
John

#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Softwarelimits (#3)
Re: How to import Apache parquet files?

On Tue, Nov 05, 2019 at 04:21:45PM +0100, Softwarelimits wrote:

Hi Imre, thanks for the quick response - yes, I found that, but I was not
sure if it is already production ready - also I would like to use the data
with the timescale extension, that is why I need a full import.

Well, we're not in the position to decide if parquet_fdw is production
ready, that's something you need to ask author of the extension (and
then also judge yourself).

That being said, I think FDW is probably the best way to do this. It's
explicitly designed to work with foreign data, so using it to access
parquet files seems somewhat natural.

The alternative is probably transforming the data into COPY format, and
then load it into Postgres using COPY (either as a file, or stdin).

Which of these options is the right one depends on your requirements.
FDW is more convenient, but row-based and probably significantly less
efficient than COPY. So if you have a lot of these parquet files, I'd
probably use the COPY. But maybe the ability to query the parquet files
directly (with FDW) is useful for you.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Nicolas Paris
nicolas.paris@riseup.net
In reply to: Softwarelimits (#1)
Re: How to import Apache parquet files?

I would like to import (lots of) Apache parquet files to a PostgreSQL 11

you might be intersted in spark-postgres library. Basically the library
allows you to bulk load parquet files in one spark command:

spark
.read.format("parquet")
.load(parquetFilesPath) // read the parquet files
.write.format("postgres")
.option("host","yourHost")
.option("partitions", 4) // 4 threads
.option("table","theTable")
.option("user","theUser")
.option("database","thePgDatabase")
.option("schema","thePgSchema")
.loada // bulk load into postgres

more details at https://github.com/EDS-APHP/spark-etl/tree/master/spark-postgres

On Tue, Nov 05, 2019 at 03:56:26PM +0100, Softwarelimits wrote:

Hi, I need to come and ask here, I did not find enough information so I hope I
am just having a bad day or somebody is censoring my search results for fun...
:)

I would like to import (lots of) Apache parquet files to a PostgreSQL 11
cluster - yes, I believe it should be done with the Python pyarrow module, but
before digging into the possible traps I would like to ask here if there is
some common, well understood and documented tool that may be helpful with that
process?

It seems that the COPY command can import binary data, but I am not able to
allocate enough resources to understand how to implement a parquet file import
with that.

I really would like follow a person with much more knowledge than me about
either PostgreSQL or Apache parquet format instead of inventing a bad wheel.

Any hints very welcome,
thank you very much for your attention!
John

--
nicolas