Loading 500m json files to database

Started by pinkerabout 6 years ago24 messagesgeneral

Jump to latest

pinker

pinker@onet.eu

about 6 years ago

Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)

Any ideas?

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

Ertan Küçükoğlu

ertan.kucukoglu@1nar.com.tr

about 6 years ago

In reply to: pinker (#1)

Re: Loading 500m json files to database

On 23 Mar 2020, at 13:20, pinker <pinker@onet.eu> wrote:

Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)

Any ideas?

Hello,

I may not be knowledge enough to answer your question.

However, if possible, you may think of using a local physical computer to do all uploading and after do backup/restore on cloud system.

Compressed backup will be far less internet traffic compared to direct data inserts.

Moreover you can do additional tricks as you mentioned.

Thanks & regards,
Ertan

Andrei Zhidenkov

andrei.zhidenkov@n26.com

about 6 years ago

In reply to: Ertan Küçükoğlu (#2)

Re: Loading 500m json files to database

Try to write a stored procedure (probably pl/python) that will accept an array of JSON objects so it will be possible to load data in chunks (by 100-1000 files) which should be faster.

Show quoted text

On 23. Mar 2020, at 12:49, Ertan Küçükoğlu <ertan.kucukoglu@1nar.com.tr> wrote:

On 23 Mar 2020, at 13:20, pinker <pinker@onet.eu> wrote:

Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)

Any ideas?

Hello,

I may not be knowledge enough to answer your question.

However, if possible, you may think of using a local physical computer to do all uploading and after do backup/restore on cloud system.

Compressed backup will be far less internet traffic compared to direct data inserts.

Moreover you can do additional tricks as you mentioned.

Thanks & regards,
Ertan

Rob Sargent

robjsargent@gmail.com

about 6 years ago

In reply to: Andrei Zhidenkov (#3)

Re: Loading 500m json files to database

On Mar 23, 2020, at 5:59 AM, Andrei Zhidenkov <andrei.zhidenkov@n26.com> wrote:

Try to write a stored procedure (probably pl/python) that will accept an array of JSON objects so it will be possible to load data in chunks (by 100-1000 files) which should be faster.

On 23. Mar 2020, at 12:49, Ertan Küçükoğlu <ertan.kucukoglu@1nar.com.tr> wrote:

On 23 Mar 2020, at 13:20, pinker <pinker@onet.eu> wrote:

Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)

Any ideas?

Hello,

I may not be knowledge enough to answer your question.

However, if possible, you may think of using a local physical computer to do all uploading and after do backup/restore on cloud system.

Compressed backup will be far less internet traffic compared to direct data inserts.

Moreover you can do additional tricks as you mentioned.

Thanks & regards,
Ertan

Drop any and all indices

Show quoted text

Chris Browne

cbbrowne@acm.org

about 6 years ago

In reply to: pinker (#1)

Re: Loading 500m json files to database

On Mon, 23 Mar 2020 at 06:24, pinker <pinker@onet.eu> wrote:

Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables
instead
of 1)

Any ideas?

Well, you're paying for a lot of overhead in that, as you're
establishing a psql command, connecting to a database, spawning a backend
process, starting a transactions, committing a transaction, closing the
backend
process, disconnecting from the database, and cleaning up after the
launching
of the psql command. And you're doing that 500 million times.

The one thing I left off that was the loading of a single tuple into
json_parts.

What you could do to improve things quite a lot would be to group some
number
of those files together, so that each time you pay for the overhead, you at
least
get the benefit of loading several entries into json_parts.

So, loosely, I'd commend using /bin/cat (or similar) to assemble several
files together
into one, and then \copy that one file in.

Having 2 tuples loaded at once drops overhead by 50%
Having 10 tuples loaded at once drops overhead by 90%
Having 100 tuples loaded at once drops overhead by 99%
Having 1000 tuples loaded at once drops overhead by 99.9%

There probably isn't too much real value to going past 1000 tuples per
batch; the
overhead, by that point, is getting pretty immaterial.

Reducing that overhead is the single most important thing you can do.

It is also quite likely that you could run such streams in parallel,
although
it would require quite a bit more information about the I/O capabilities of
your
hardware to know if that would do any good.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

Rob Sargent

robjsargent@gmail.com

about 6 years ago

In reply to: pinker (#1)

Re: Loading 500m json files to database

On 3/23/20 4:24 AM, pinker wrote:

Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)

Any ideas?

Most advanced languages have a bulk copy implementation. I've found this to be blindingly fast when the receiving table has no indices, constraints. It's not clear how large your files are, but you might take this time to "normalized" them: extract any id, datatype, etc into table attributes.

Show quoted text

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

Adrian Klaver

adrian.klaver@aklaver.com

about 6 years ago

In reply to: pinker (#1)

Re: Loading 500m json files to database

On 3/23/20 3:24 AM, pinker wrote:

Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

Aggregating the JSON files as others have suggested would help greatly.

Knowing what is happening in json_parts() might help folks provide
further tips.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)

Any ideas?

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

--
Adrian Klaver
adrian.klaver@aklaver.com

pinker

pinker@onet.eu

about 6 years ago

In reply to: Chris Browne (#5)

Re: Loading 500m json files to database

Christopher Browne-3 wrote

Well, you're paying for a lot of overhead in that, as you're
establishing a psql command, connecting to a database, spawning a backend
process, starting a transactions, committing a transaction, closing the
backend
process, disconnecting from the database, and cleaning up after the
launching
of the psql command. And you're doing that 500 million times.

The one thing I left off that was the loading of a single tuple into
json_parts.

What you could do to improve things quite a lot would be to group some
number
of those files together, so that each time you pay for the overhead, you
at
least
get the benefit of loading several entries into json_parts.

So, loosely, I'd commend using /bin/cat (or similar) to assemble several
files together
into one, and then \copy that one file in.

Having 2 tuples loaded at once drops overhead by 50%
Having 10 tuples loaded at once drops overhead by 90%
Having 100 tuples loaded at once drops overhead by 99%
Having 1000 tuples loaded at once drops overhead by 99.9%

There probably isn't too much real value to going past 1000 tuples per
batch; the
overhead, by that point, is getting pretty immaterial.

Reducing that overhead is the single most important thing you can do.

Yes, I was thinking about that but no idea now how to do it right now. like
some kind of outer loop to concatenate those files? and adding delimiter
between them?

Christopher Browne-3 wrote

It is also quite likely that you could run such streams in parallel,
although
it would require quite a bit more information about the I/O capabilities
of
your
hardware to know if that would do any good.

I can spin up every size of instance.

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

pinker

pinker@onet.eu

about 6 years ago

In reply to: Ertan Küçükoğlu (#2)

Re: Loading 500m json files to database

Ertan Küçükoğlu wrote

However, if possible, you may think of using a local physical computer to
do all uploading and after do backup/restore on cloud system.

Compressed backup will be far less internet traffic compared to direct
data inserts.

I was thinking about that but data source is a blob storage, so downloading
it first and then loading locally it's couple days extra for processing :/
it's not that fast even when I'm doing it locally ... so that would be like
extra 2 steps overhead :/

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

#10

pinker

pinker@onet.eu

about 6 years ago

In reply to: Andrei Zhidenkov (#3)

Re: Loading 500m json files to database

it's a cloud and no plpythonu extension avaiable unfortunately

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

#11

pinker

pinker@onet.eu

about 6 years ago

In reply to: Rob Sargent (#4)

Re: Loading 500m json files to database

there is no indexes nor foreign keys, or any other constraints

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

#12

pinker

pinker@onet.eu

about 6 years ago

In reply to: Adrian Klaver (#7)

Re: Loading 500m json files to database

Hi,
json_parts it's just single table with 2 column:

no indexes, constraints or anything else

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

#13

Adrian Klaver

adrian.klaver@aklaver.com

about 6 years ago

In reply to: pinker (#12)

Re: Loading 500m json files to database

On 3/23/20 5:26 PM, pinker wrote:

Hi,
json_parts it's just single table with 2 column:

Well I misread that.

Table "public.json_parts"
Column | Type | Collation | Nullable | Default
| Storage | Stats target | Description
-----------+---------+-----------+----------+----------------------------------------+----------+--------------+-------------
id | integer | | not null |
nextval('json_parts_id_seq'::regclass) | plain | |
json_data | jsonb | | |

no indexes, constraints or anything else

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

--
Adrian Klaver
adrian.klaver@aklaver.com

#14

Adrian Klaver

adrian.klaver@aklaver.com

about 6 years ago

In reply to: pinker (#10)

Re: Loading 500m json files to database

On 3/23/20 5:23 PM, pinker wrote:

it's a cloud and no plpythonu extension avaiable unfortunately

I presume Python itself is available, so would it not be possible to
create a program that concatenates the files into batches and COPY(s)
that data into Postgres using the Psycopg2 COPY functions:

https://www.psycopg.org/docs/usage.html#using-copy-to-and-copy-from

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

--
Adrian Klaver
adrian.klaver@aklaver.com

#15

David G. Johnston

david.g.johnston@gmail.com

about 6 years ago

In reply to: pinker (#1)

Re: Loading 500m json files to database

On Mon, Mar 23, 2020 at 3:24 AM pinker <pinker@onet.eu> wrote:

time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done

Don't know whether this is faster but it does avoid spinning up a
connection multiple times.

#bash, linux
function append_each_split_file_to_etl_load_script() {
for filetoload in ./*; do
ronumber="$(basename $filetoload)"
# only process files since subdirs can be present
if [[ -f "$filetoload" ]]; then
echo ""
echo "\set invoice"' `cat '"'""$filetoload""'"'`'
echo ", ('$ronumber',:'invoice')"
fi >> "$PSQLSCRIPT"
done

echo "" >> "$PSQLSCRIPT"
echo ";" >> "$PSQLSCRIPT"
echo "" >> "$PSQLSCRIPT"
}

There is a bit other related code that is needed (for my specific usage)
but this is the core of it. Use psql variables to capture the contents of
each file into a variable and then just perform a normal insert
(specifically, a VALUES (...), (...) variant). Since you can intermix psql
and SQL you basically output a bloody long script, that has memory issues
at scale - but you can divide and conquer - and then "psql --file
bloody_long_script_part_1_of_100000.psql".

David J.

#16

Rob Sargent

robjsargent@gmail.com

about 6 years ago

In reply to: David G. Johnston (#15)

Re: Loading 500m json files to database

On Mar 23, 2020, at 7:11 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:

On Mon, Mar 23, 2020 at 3:24 AM pinker <pinker@onet.eu <mailto:pinker@onet.eu>> wrote:
time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done

Don't know whether this is faster but it does avoid spinning up a connection multiple times.

#bash, linux
function append_each_split_file_to_etl_load_script() {
for filetoload in ./*; do
ronumber="$(basename $filetoload)"
# only process files since subdirs can be present
if [[ -f "$filetoload" ]]; then
echo ""
echo "\set invoice"' `cat '"'""$filetoload""'"'`'
echo ", ('$ronumber',:'invoice')"
fi >> "$PSQLSCRIPT"
done

echo "" >> "$PSQLSCRIPT"
echo ";" >> "$PSQLSCRIPT"
echo "" >> "$PSQLSCRIPT"
}

There is a bit other related code that is needed (for my specific usage) but this is the core of it. Use psql variables to capture the contents of each file into a variable and then just perform a normal insert (specifically, a VALUES (...), (...) variant). Since you can intermix psql and SQL you basically output a bloody long script, that has memory issues at scale - but you can divide and conquer - and then "psql --file bloody_long_script_part_1_of_100000.psql".

David J.

Can one put 550M files in a single directory? I thought it topped out at 16M or so.

#17

pinker

pinker@onet.eu

about 6 years ago

In reply to: Rob Sargent (#16)

Re: Loading 500m json files to database

it's in a blob storage in Azure. I'm testing with 1m that I have locally

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

#18

pinker

pinker@onet.eu

about 6 years ago

In reply to: Adrian Klaver (#14)

Re: Loading 500m json files to database

hmm now I'm thinking maybe setting up pgbouncer in front of postgres with
statement mode would help?

--
Sent from: https://www.postgresql-archive.org/PostgreSQL-general-f1843780.html

#19

Reid Thompson

Reid.Thompson@omnicell.com

about 6 years ago

In reply to: pinker (#1)

Re: Loading 500m json files to database

On Mon, 2020-03-23 at 03:24 -0700, pinker wrote:

[EXTERNAL SOURCE]

Hi, do you have maybe idea how to make loading process faster?

I have 500 millions of json files (1 json per file) that I need to load to
db.
My test set is "only" 1 million files.

What I came up with now is:

time for i in datafiles/*; do
psql -c "\copy json_parts(json_data) FROM $i"&
done

which is the fastest so far. But it's not what i expect. Loading 1m of data
takes me ~3h so loading 500 times more is just unacceptable.

some facts:
* the target db is on cloud so there is no option to do tricks like turning
fsync off
* version postgres 11
* i can spin up huge postgres instance if necessary in terms of cpu/ram
* i tried already hash partitioning (to write to 10 different tables instead
of 1)

Any ideas?

https://www.gnu.org/software/parallel/

#20

Peter J. Holzer

hjp-pgsql@hjp.at

about 6 years ago

In reply to: pinker (#8)

Re: Loading 500m json files to database

On 2020-03-23 17:18:45 -0700, pinker wrote:

Christopher Browne-3 wrote

Well, you're paying for a lot of overhead in that, as you're
establishing a psql command, connecting to a database, spawning a
backend process, starting a transactions, committing a transaction,
closing the backend process, disconnecting from the database, and
cleaning up after the launching of the psql command. And you're
doing that 500 million times.

The one thing I left off that was the loading of a single tuple into
json_parts.

[...]

Reducing that overhead is the single most important thing you can do.

Yes, I was thinking about that but no idea now how to do it right now.

Do you know any programming language (Python, Perl, ...)? You'll
probably get a huge gain from writing a script that just opens the
connection once and then inserts each file.

Copy usually is even faster by a fair amount, but since you have to read
the data for each row from a different file (and - if I understood you
correctly, a remote one at that), the additional speedup is probably not
that great in this case.

Splitting the work int batches and executing several batches in parallel
probably helps.

#21