Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

Started by Shaozhong SHIover 4 years ago12 messagesgeneral
Jump to latest
#1Shaozhong SHI
shishaozhong@gmail.com

Has anyone tested this one?
A Fast Method to Bulk Insert a Pandas DataFrame into Postgres · Ellis
Valentiner
<https://ellisvalentiner.com/post/a-fast-method-to-insert-a-pandas-dataframe-into-postgres/&gt;

I tried psql_insert_copy method, but I got the following error message.

to_sql() got an unexpected keyword argument 'method'

Can anyone shed light on this?

Regards,

David

#2Shaozhong SHI
shishaozhong@gmail.com
In reply to: Shaozhong SHI (#1)

Has anyone tested this one?
A Fast Method to Bulk Insert a Pandas DataFrame into Postgres · Ellis
Valentiner
<https://ellisvalentiner.com/post/a-fast-method-to-insert-a-pandas-dataframe-into-postgres/&gt;

I tried psql_insert_copy method, but I got the following error message.

to_sql() got an unexpected keyword argument 'method'

Can anyone shed light on this?

Regards,

David

#3Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Shaozhong SHI (#2)
Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

On 10/4/21 8:44 AM, Shaozhong SHI wrote:

Has anyone tested this one?
A Fast Method to Bulk Insert a Pandas DataFrame into Postgres · Ellis
Valentiner
<https://ellisvalentiner.com/post/a-fast-method-to-insert-a-pandas-dataframe-into-postgres/&gt;

I tried psql_insert_copy method, but I got the following error message.

to_sql() got an unexpected keyword argument 'method'

Can anyone shed light on this?

method is a legitimate keyword:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html?highlight=to_sql#pandas.DataFrame.to_sql

So there must be something about how you used it.

Pandas version?

The actual code you used when the error occurred?

Regards,

David

--
Adrian Klaver
adrian.klaver@aklaver.com

#4Shaozhong SHI
shishaozhong@gmail.com
In reply to: Adrian Klaver (#3)
Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

Hello, Adrian Klaver,

Pandas version is 0.23.0.

I used the following code:

def psql_insert_copy(table, conn, keys, data_iter):
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)

columns = ', '.join('"{}"'.format(k) for k in keys)
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name)
else:
table_name = table.name

sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)

engine = create_engine('postgresql+psycopg2://:5432/postgres')
try:
df.to_sql('test1', engine, schema='public', if_exists='append',
index=False, method=psql_insert_copy)

I could not find obvious reasons.

Regards,

David

On Mon, 4 Oct 2021 at 17:06, Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

Show quoted text

On 10/4/21 8:44 AM, Shaozhong SHI wrote:

Has anyone tested this one?
A Fast Method to Bulk Insert a Pandas DataFrame into Postgres · Ellis
Valentiner
<

https://ellisvalentiner.com/post/a-fast-method-to-insert-a-pandas-dataframe-into-postgres/

I tried psql_insert_copy method, but I got the following error message.

to_sql() got an unexpected keyword argument 'method'

Can anyone shed light on this?

method is a legitimate keyword:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html?highlight=to_sql#pandas.DataFrame.to_sql

So there must be something about how you used it.

Pandas version?

The actual code you used when the error occurred?

Regards,

David

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Rob Sargent
robjsargent@gmail.com
In reply to: Shaozhong SHI (#4)
Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

On Oct 4, 2021, at 10:20 AM, Shaozhong SHI <shishaozhong@gmail.com> wrote:

Hello, Adrian Klaver,

Pandas version is 0.23.0.

I used the following code:

def psql_insert_copy(table, conn, keys, data_iter):
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)

columns = ', '.join('"{}"'.format(k) for k in keys)
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name <http://table.name/&gt;)
else:
table_name = table.name <http://table.name/&gt;

sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)

engine = create_engine('postgresql+psycopg2://:5432/postgres')
try:
df.to_sql('test1', engine, schema='public', if_exists='append', index=False, method=psql_insert_copy)

you need to quote ‘psql_insert_copy'

#6Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Shaozhong SHI (#4)
Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

On 10/4/21 9:20 AM, Shaozhong SHI wrote:

Hello, Adrian Klaver,

Pandas version is 0.23.0.

The reason the below does not work is method did not show up until
pandas 0.24.0.

I used the following code:

def psql_insert_copy(table, conn, keys, data_iter):
    # gets a DBAPI connection that can provide a cursor
    dbapi_conn = conn.connection
    with dbapi_conn.cursor() as cur:
        s_buf = StringIO()
        writer = csv.writer(s_buf)
        writer.writerows(data_iter)
        s_buf.seek(0)

        columns = ', '.join('"{}"'.format(k) for k in keys)
        if table.schema:
            table_name = '{}.{}'.format(table.schema, table.name
<http://table.name&gt;)
        else:
            table_name = table.name <http://table.name&gt;

        sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
            table_name, columns)
        cur.copy_expert(sql=sql, file=s_buf)
engine = create_engine('postgresql+psycopg2://:5432/postgres')
try:
    df.to_sql('test1', engine, schema='public', if_exists='append',
index=False, method=psql_insert_copy)

I could not find obvious reasons.

Regards,

David

--
Adrian Klaver
adrian.klaver@aklaver.com

#7Shaozhong SHI
shishaozhong@gmail.com
In reply to: Adrian Klaver (#6)
Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

Hello, Adrian Klaver,
What is the robust way to upgrade Pandas?
Regards,
David

On Monday, 4 October 2021, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

Show quoted text

On 10/4/21 9:20 AM, Shaozhong SHI wrote:

Hello, Adrian Klaver,

Pandas version is 0.23.0.

The reason the below does not work is method did not show up until pandas
0.24.0.

I used the following code:

def psql_insert_copy(table, conn, keys, data_iter):
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)

columns = ', '.join('"{}"'.format(k) for k in keys)
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name <
http://table.name&gt;)
else:
table_name = table.name <http://table.name&gt;

sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)
engine = create_engine('postgresql+psycopg2://:5432/postgres')
try:
df.to_sql('test1', engine, schema='public', if_exists='append',
index=False, method=psql_insert_copy)

I could not find obvious reasons.

Regards,

David

--
Adrian Klaver
adrian.klaver@aklaver.com

#8Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Shaozhong SHI (#7)
Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

On 10/4/21 10:10 AM, Shaozhong SHI wrote:

Hello, Adrian Klaver,
What is the robust way to upgrade Pandas?

Carefully.

The most recent version is 1.3.3, which is approximately 5 versions
ahead of where you are now. The big jump is when Pandas went from 0.25
to 1.0. See docs here:

https://pandas.pydata.org/docs/whatsnew/v1.0.0.html?highlight=upgrade

So the process should be 0.24 -> 0.25, verify, 0.25 -> 1.0, verify. Then
on to wherever you want to end up a step at a time.

Before each step spend time here:

https://pandas.pydata.org/docs/whatsnew/

to see what the gotcha's are.

Regards,
David

--
Adrian Klaver
adrian.klaver@aklaver.com

#9Steve Midgley
science@misuse.org
In reply to: Shaozhong SHI (#1)
Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

On Mon, Oct 4, 2021 at 8:55 AM Shaozhong SHI <shishaozhong@gmail.com> wrote:

Has anyone tested this one?
A Fast Method to Bulk Insert a Pandas DataFrame into Postgres · Ellis
Valentiner
<https://ellisvalentiner.com/post/a-fast-method-to-insert-a-pandas-dataframe-into-postgres/&gt;

I tried psql_insert_copy method, but I got the following error message.

to_sql() got an unexpected keyword argument 'method'

Can anyone shed light on this?

Interesting - I wasn't aware there was a faster method than "COPY ... FROM
... CSV"

Maybe my best input is that I've always found COPY/FROM/CSV to be insanely
fast and really easy to write from any standard console or ORM, so maybe
the fractional performance increase from other methods isn't worth it,
given how performant, reliable and easy to use this approach is? Not really
an answer but food for thought..

Steve

#10Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Adrian Klaver (#8)
Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

On 10/4/21 10:28 AM, Adrian Klaver wrote:

On 10/4/21 10:10 AM, Shaozhong SHI wrote:

Hello, Adrian Klaver,
What is the robust way to upgrade Pandas?

Carefully.

The most recent version is 1.3.3, which is approximately 5 versions
ahead of where you are now. The big jump is when Pandas went from 0.25
to 1.0. See docs here:

https://pandas.pydata.org/docs/whatsnew/v1.0.0.html?highlight=upgrade

So the process should be 0.24 -> 0.25, verify, 0.25 -> 1.0, verify. Then
on to wherever you want to end up a step at a time.

Before each step spend time here:

https://pandas.pydata.org/docs/whatsnew/

to see what the gotcha's are.

Should have added:

If you are not already working in a virtualenv it would be a good idea
to do the above in one or more.

Regards,
David

--
Adrian Klaver
adrian.klaver@aklaver.com

#11Shaozhong SHI
shishaozhong@gmail.com
In reply to: Steve Midgley (#9)
Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

Hello, Steve,

That is interesting.

Send me the link for Python doing "Copy---From....CSV".

I would like to review and test.

Regards,

David

On Mon, 4 Oct 2021 at 18:30, Steve Midgley <science@misuse.org> wrote:

Show quoted text

On Mon, Oct 4, 2021 at 8:55 AM Shaozhong SHI <shishaozhong@gmail.com>
wrote:

Has anyone tested this one?
A Fast Method to Bulk Insert a Pandas DataFrame into Postgres · Ellis
Valentiner
<https://ellisvalentiner.com/post/a-fast-method-to-insert-a-pandas-dataframe-into-postgres/&gt;

I tried psql_insert_copy method, but I got the following error message.

to_sql() got an unexpected keyword argument 'method'

Can anyone shed light on this?

Interesting - I wasn't aware there was a faster method than "COPY ... FROM
... CSV"

Maybe my best input is that I've always found COPY/FROM/CSV to be insanely
fast and really easy to write from any standard console or ORM, so maybe
the fractional performance increase from other methods isn't worth it,
given how performant, reliable and easy to use this approach is? Not really
an answer but food for thought..

Steve

#12Steve Midgley
science@misuse.org
In reply to: Shaozhong SHI (#11)
Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

On Mon, Oct 4, 2021 at 1:19 PM Shaozhong SHI <shishaozhong@gmail.com> wrote:

Hello, Steve,

That is interesting.

Send me the link for Python doing "Copy---From....CSV".

I would like to review and test.

Regards,

David

On Mon, 4 Oct 2021 at 18:30, Steve Midgley <science@misuse.org> wrote:

On Mon, Oct 4, 2021 at 8:55 AM Shaozhong SHI <shishaozhong@gmail.com>
wrote:

Has anyone tested this one?
A Fast Method to Bulk Insert a Pandas DataFrame into Postgres · Ellis
Valentiner
<https://ellisvalentiner.com/post/a-fast-method-to-insert-a-pandas-dataframe-into-postgres/&gt;

I tried psql_insert_copy method, but I got the following error message.

to_sql() got an unexpected keyword argument 'method'

Can anyone shed light on this?

Interesting - I wasn't aware there was a faster method than "COPY ...
FROM ... CSV"

Maybe my best input is that I've always found COPY/FROM/CSV to be
insanely fast and really easy to write from any standard console or ORM, so
maybe the fractional performance increase from other methods isn't worth
it, given how performant, reliable and easy to use this approach is? Not
really an answer but food for thought..

The way I've implemented that (using Ruby, but very similar) is to

actually copy from STDIN rather than CSV. In my implementation I wasn't
running my import code on the Postgres server, so I couldn't get the CSV
file onto the local DB machine. So I had to feed the data over the wire via
STDIN. This is slower than if you can point Postgres to a CSV file that the
Pg server itself can access on the file system, but it's more flexible and
still remarkably fast. It can be even (a bit) faster if you don't feed
STDIN line by line, but in my experience it's a nightmare to debug if you
don't get a line number when an import file barfs.

Anyway, here's the Ruby code that shows the fast import concept:
https://gist.github.com/science/393907d4123c87ed767bc81e9dd5a7da

I wrote this a LONG time ago, but I think the concepts are still relevant.
IIRC, feeding Pg from STDIN was ~400x faster than using "insert into" or
other SQL type commands to get the data in.

I hope this is helpful - I would think this code is easily portable to
Python, as you just need to get a raw connection to Postgres from your ORM
adapter and then model the commands the same way in the code above. Good
luck and write with questions..

Steve