Importing a Large .ndjson file

Started by Sankar Palmost 6 years ago7 messagesgeneral
Jump to latest
#1Sankar P
sankar.curiosity@gmail.com

Hi

I have a .ndjson file. It is a new-line-delimited JSON file. It is
about 10GB and has about 100,000 records.

Some sample records:
```
{ "key11": "value11", "key12": [ "value12.1", "value12.2"], "key13": {
"k111": "v111" } } \n\r
{ "key21": "value21", "key22": [ "value22.1", "value22.2"] }
```
Now I want to INSERT these json records into my postgres table of the
following schema:

```
CREATE TABLE myTable (id BIGSERIAL, content JSONB);
```

Where I want the records to be inserted to the `content` field of my
postgres table.

What is the best way to do this on a postgresql database, deployed in
kubernetes, with a 1 GB RAM allocated ?

I can probably write a that would read this file line-by-line and
INSERT into the database, in a transaction. But that I believe would
take a lot of network traffic and I want to know if there is a better
way to do this.

Thanks.

--
Sankar P
http://psankar.blogspot.com

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sankar P (#1)
Re: Importing a Large .ndjson file

Sankar P <sankar.curiosity@gmail.com> writes:

I have a .ndjson file. It is a new-line-delimited JSON file. It is
about 10GB and has about 100,000 records.
Some sample records:
{ "key11": "value11", "key12": [ "value12.1", "value12.2"], "key13": {
"k111": "v111" } } \n\r
{ "key21": "value21", "key22": [ "value22.1", "value22.2"] }

What is the best way to do this on a postgresql database, deployed in
kubernetes, with a 1 GB RAM allocated ?

It looks like plain old COPY would do this just fine, along the lines
of (in psql)

\copy myTable(content) from 'myfile.ndjson'

If the newlines actually are \n\r rather than the more usual \r\n,
you might have to clean that up to stop COPY from thinking they
represent two line endings not one.

I'd advise extracting the first hundred or so lines of the file and doing
a test import into a temporary table, just to verify the process.

regards, tom lane

#3Sankar P
sankar.curiosity@gmail.com
In reply to: Tom Lane (#2)
Re: Importing a Large .ndjson file

It looks like plain old COPY would do this just fine, along the lines
of (in psql)

\copy myTable(content) from 'myfile.ndjson'

Indeed. Thanks.

--
Sankar P
http://psankar.blogspot.com

#4Sankar P
sankar.curiosity@gmail.com
In reply to: Sankar P (#3)
Re: Importing a Large .ndjson file

It looks like plain old COPY would do this just fine, along the lines
of (in psql)

\copy myTable(content) from 'myfile.ndjson'

I spoke too soon. While this worked fine when there were no indexes
and finished within 10 minutes, with GIN index on the jsonb column, it
is taking hours and still not completing.

--
Sankar P
http://psankar.blogspot.com

#5Michael Lewis
mlewis@entrata.com
In reply to: Sankar P (#4)
Re: Importing a Large .ndjson file

I spoke too soon. While this worked fine when there were no indexes
and finished within 10 minutes, with GIN index on the jsonb column, it
is taking hours and still not completing.

It is always recommended to create indexes AFTER loading data. Sometimes it
can be faster to drop all indexes on the table, load huge data, and
re-create the indexes but there's no hard & fast rule. If you are adding
100k records to an empty or near empty table, I would remove all indexes
and create them after. Be sure you have sufficient maintenance_work_mem
also.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Sankar P (#4)
Re: Importing a Large .ndjson file

Sankar P <sankar.curiosity@gmail.com> writes:

I spoke too soon. While this worked fine when there were no indexes
and finished within 10 minutes, with GIN index on the jsonb column, it
is taking hours and still not completing.

There's the generic advice that building an index after-the-fact
is often cheaper than updating it incrementally. For GIN indexes
specifically, see also

https://www.postgresql.org/docs/current/gin-tips.html

regards, tom lane

#7Sankar P
sankar.curiosity@gmail.com
In reply to: Tom Lane (#6)
Re: Importing a Large .ndjson file

Sankar P <sankar.curiosity@gmail.com> writes:

I spoke too soon. While this worked fine when there were no indexes
and finished within 10 minutes, with GIN index on the jsonb column, it
is taking hours and still not completing.

There's the generic advice that building an index after-the-fact
is often cheaper than updating it incrementally. For GIN indexes
specifically, see also

https://www.postgresql.org/docs/current/gin-tips.html

Thanks guys.

--
Sankar P
http://psankar.blogspot.com