Load a csv or a avro?

Started by sudalmost 2 years ago10 messagesgeneral

suds1434@gmail.com

almost 2 years ago

Hello all,

Its postgres database. We have option of getting files in csv and/or in
avro format messages from another system to load it into our postgres
database. The volume will be 300million messages per day across many files
in batches.

My question was, which format should we chose in regards to faster data
loading performance ? and if any other aspects to it also should be
considered apart from just loading performance?

kashi.zeeshan@gmail.com

almost 2 years ago

In reply to: sud (#1)

Re: Load a csv or a avro?

Hi

There are different data formats available, following are few points for
there performance implications

1. CSV : It's easy to use and widely supported but it can be slower due to
parsing overload.
2. Binary : Its faster to load but not human understandable.

Hope this helps.

Regards
Kashif Zeeshan

On Fri, Jul 5, 2024 at 2:08 PM sud <suds1434@gmail.com> wrote:

Show quoted text

Hello all,

Its postgres database. We have option of getting files in csv and/or in
avro format messages from another system to load it into our postgres
database. The volume will be 300million messages per day across many files
in batches.

My question was, which format should we chose in regards to faster data
loading performance ? and if any other aspects to it also should be
considered apart from just loading performance?

Josef Šimánek

josef.simanek@gmail.com

almost 2 years ago

In reply to: sud (#1)

Re: Load a csv or a avro?

pá 5. 7. 2024 v 11:08 odesílatel sud <suds1434@gmail.com> napsal:

Hello all,

Its postgres database. We have option of getting files in csv and/or in avro format messages from another system to load it into our postgres database. The volume will be 300million messages per day across many files in batches.

My question was, which format should we chose in regards to faster data loading performance ? and if any other aspects to it also should be considered apart from just loading performance?

We are able to load ~300 million rows per one day using CSV and COPY
functions (https://www.postgresql.org/docs/current/libpq-copy.html#LIBPQ-COPY-SEND).

mmikram@gmail.com

almost 2 years ago

In reply to: Josef Šimánek (#3)

Re: Load a csv or a avro?

Hi,

Performance Considerations
Avro files are smaller due to compression so needing less I/O time.
whereas CSV files are simpler but larger in size so read/write will need
more time.
COPY command works very well with CSV files whereas ETL process is
required for handling Avro.

Regards,
Muhammad Ikram

On Fri, Jul 5, 2024 at 3:03 PM Josef Šimánek <josef.simanek@gmail.com>
wrote:

pá 5. 7. 2024 v 11:08 odesílatel sud <suds1434@gmail.com> napsal:

Hello all,

Its postgres database. We have option of getting files in csv and/or in

avro format messages from another system to load it into our postgres
database. The volume will be 300million messages per day across many files
in batches.

My question was, which format should we chose in regards to faster data

loading performance ? and if any other aspects to it also should be
considered apart from just loading performance?

We are able to load ~300 million rows per one day using CSV and COPY
functions (
https://www.postgresql.org/docs/current/libpq-copy.html#LIBPQ-COPY-SEND).

--
Muhammad Ikram

ronljohnsonjr@gmail.com

almost 2 years ago

In reply to: sud (#1)

Re: Load a csv or a avro?

On Fri, Jul 5, 2024 at 5:08 AM sud <suds1434@gmail.com> wrote:

Hello all,

Its postgres database. We have option of getting files in csv and/or in
avro format messages from another system to load it into our postgres
database. The volume will be 300million messages per day across many files
in batches.

My question was, which format should we chose in regards to faster data
loading performance ?

What application will be loading the data? If psql, then go with CSV;
COPY is *really* efficient.

If the PG tables are already mapped to the avro format, then maybe avro
will be faster.

and if any other aspects to it also should be considered apart from just
loading performance?

If all the data comes in at night, drop as many indices as possible before
loading.

Load each file in as few DB connections as possible: the most efficient
binary format won't do you any good if you open and close a connection for
each and every row.

adrian.klaver@aklaver.com

almost 2 years ago

In reply to: sud (#1)

Re: Load a csv or a avro?

On 7/5/24 02:08, sud wrote:

Hello all,

Its postgres database. We have option of getting files in csv and/or in
avro format messages from another system to load it into our postgres
database. The volume will be 300million messages per day across many
files in batches.

Are dumping the entire contents of each file or are you pulling a
portion of the data out?

My question was, which format should we chose in regards to faster data
loading performance ? and if any other aspects to it also should be
considered apart from just loading performance?

--
Adrian Klaver
adrian.klaver@aklaver.com

suds1434@gmail.com

almost 2 years ago

In reply to: Kashif Zeeshan (#2)

Re: Load a csv or a avro?

On Fri, Jul 5, 2024 at 3:27 PM Kashif Zeeshan <kashi.zeeshan@gmail.com>
wrote:

Hi

There are different data formats available, following are few points for
there performance implications

1. CSV : It's easy to use and widely supported but it can be slower due to
parsing overload.
2. Binary : Its faster to load but not human understandable.

Hope this helps.

Regards
Kashif Zeeshan

My understanding was that it will be faster to load .csv as it is
already being mapped to table rows and columns whereas in case of .avro the
mapping has to be done so that the fields in the avro can be mapped to the
columns in the table appropriately and that will be having additional
overhead. Is my understanding correct?

suds1434@gmail.com

almost 2 years ago

In reply to: Adrian Klaver (#6)

Re: Load a csv or a avro?

On Fri, Jul 5, 2024 at 8:24 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 7/5/24 02:08, sud wrote:

Hello all,

Its postgres database. We have option of getting files in csv and/or in
avro format messages from another system to load it into our postgres
database. The volume will be 300million messages per day across many
files in batches.

Are dumping the entire contents of each file or are you pulling a
portion of the data out?

Yes, all the fields in the file have to be loaded to the columns in the
tables in postgres. But how will that matter here for deciding if we should
ask the data in .csv or .avro format from the outside system to load into
the postgres database in row and column format? Again my understanding was
that irrespective of anything , the .csv file load will always faster as
because the data is already stored in row and column format as compared to
the .avro file in which the parser has to perform additional job to make it
row and column format or map it to the columns of the database table. Is my
understanding correct here?

ronljohnsonjr@gmail.com

almost 2 years ago

In reply to: sud (#8)

Re: Load a csv or a avro?

On Sat, Jul 6, 2024 at 4:10 PM sud <suds1434@gmail.com> wrote:

On Fri, Jul 5, 2024 at 8:24 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 7/5/24 02:08, sud wrote:

Hello all,

Its postgres database. We have option of getting files in csv and/or in
avro format messages from another system to load it into our postgres
database. The volume will be 300million messages per day across many
files in batches.

Are dumping the entire contents of each file or are you pulling a
portion of the data out?

Yes, all the fields in the file have to be loaded to the columns in the
tables in postgres.

But you didn't say *which* columns or *which* tables.

If one row of CSV input must be split into multiple tables, then it might
be pretty slow.

But how will that matter here for deciding if we should ask the data in
.csv or .avro format from the outside system to load into the postgres
database in row and column format? Again my understanding was that
irrespective of anything , the .csv file load will always faster as because
the data is already stored in row and column format as compared to the
.avro file in which the parser has to perform additional job to make it row
and column format or map it to the columns of the database table. Is my
understanding correct here?

Yes and no. It all depends on how well each input row maps to a Postgresql
table.

Bottom line: you want an absolute answer, but we can't give you an absolute
answer, since we don't know what the input data looks like, and we don't
know what the Postgresql tables look like.

An AVRO file *might* be faster to input than CSV, or it might be horribly
slower.

And you might incompetently program a CSV importer so that it's horribly
slow.

We can't give absolute answers without knowing more details than the
ambiguous generalities in your emails.

adrian.klaver@aklaver.com

almost 2 years ago

In reply to: sud (#8)

Re: Load a csv or a avro?

On 7/6/24 13:09, sud wrote:

On Fri, Jul 5, 2024 at 8:24 PM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

On 7/5/24 02:08, sud wrote:

Hello all,

Its postgres database. We have option of getting files in csv

and/or in

avro format messages from another system to load it into our

postgres

database. The volume will be 300million messages per day across many
files in batches.

Are dumping the entire contents of each file or are you pulling a
portion of the data out?

Yes, all the fields in the file have to be loaded to the columns in the
tables in postgres. But how will that matter here for deciding if we
should ask the data in .csv or .avro format from the outside system to
load into the postgres database in row and column format? Again my
understanding was that irrespective of anything , the .csv file load
will always faster as because the data is already stored in row and
column format as compared to the .avro file in which the parser has to
perform additional job to make it row and column format or map it to the
columns of the database table. Is my understanding correct here?

If you are going to use complete rows and all rows then COPY of CSV in
Postgres would be your best choice.

--
Adrian Klaver
adrian.klaver@aklaver.com