How batch processing works
Hello,
Saw multiple threads around the same , so I want some clarification. As we
know row by row is slow by slow processing , so in heavy write systems(say
the client app is in Java) , people asked to do DMLS in batches rather in a
row by row fashion to minimize the chatting or context switches between
database and client which is resource intensive. What I understand is that
, a true batch processing means the client has to collect all the input
bind values and prepare the insert statement and submit to the database at
oneshot and then commit.
What it means actually and if we divide the option as below, which method
truly does batch processing or there exists some other method for doing the
batch processing considering postgres as backend database?
I understand, the first method below is truly a row by row processing in
which context switches happen between client and database with each row,
whereas the second method is just batching the commits but not a true batch
processing as it will do the same amount of context switching between the
database and client. But regarding the third and fourth method, will both
execute in a similar fashion in the database with the same number of
context switches? If any other better method exists to do these inserts in
batches? Appreciate your guidance.
CREATE TABLE parent_table (
id SERIAL PRIMARY KEY,
name TEXT
);
CREATE TABLE child_table (
id SERIAL PRIMARY KEY,
parent_id INT REFERENCES parent_table(id),
value TEXT
);
Method-1
insert into parent_table values(1,'a');
commit;
insert into parent_table values(2,'a');
commit;
insert into child_table values(1,1,'a');
Commit;
insert into child_table values(1,2,'a');
commit;
VS
Method-2
insert into parent_table values(1,'a');
insert into parent_table values(2,'a');
insert into child_table values(1,1,'a');
insert into child_table values(1,2,'a');
Commit;
VS
Method-3
with
a as ( insert into parent_table values(1,'a') )
, a1 as (insert into parent_table values(2,'a') )
, b as (insert into child_table values(1,1,'a') )
, b1 as (insert into child_table values(1,2,'a') )
select;
commit;
Method-4
INSERT INTO parent_table VALUES (1, 'a'), (2, 'a');
INSERT INTO child_table VALUES (1,1, 'a'), (1,2, 'a');
commit;
Regards
Lok
On Thu, Sep 19, 2024 at 1:31 AM Lok P <loknath.73@gmail.com> wrote:
Hello,
Saw multiple threads around the same , so I want some clarification. As we
know row by row is slow by slow processing , so in heavy write systems(say
the client app is in Java) , people asked to do DMLS in batches rather in a
row by row fashion to minimize the chatting or context switches between
database and client which is resource intensive. What I understand is that
, a true batch processing means the client has to collect all the input
bind values and prepare the insert statement and submit to the database at
oneshot and then commit.What it means actually and if we divide the option as below, which method
truly does batch processing or there exists some other method for doing the
batch processing considering postgres as backend database?I understand, the first method below is truly a row by row processing in
which context switches happen between client and database with each row,
whereas the second method is just batching the commits but not a true batch
processing as it will do the same amount of context switching between the
database and client. But regarding the third and fourth method, will both
execute in a similar fashion in the database with the same number of
context switches? If any other better method exists to do these inserts in
batches? Appreciate your guidance.CREATE TABLE parent_table (
id SERIAL PRIMARY KEY,
name TEXT
);CREATE TABLE child_table (
id SERIAL PRIMARY KEY,
parent_id INT REFERENCES parent_table(id),
value TEXT
);Method-1
insert into parent_table values(1,'a');
commit;
insert into parent_table values(2,'a');
commit;
insert into child_table values(1,1,'a');
Commit;
insert into child_table values(1,2,'a');
commit;VS
Method-2
insert into parent_table values(1,'a');
insert into parent_table values(2,'a');
insert into child_table values(1,1,'a');
insert into child_table values(1,2,'a');
Commit;
As a former "DP" programmer, from an application point of view, this is
absolutely batch programming.
My experience was with COBOL and C, though, which were low overhead. From
what I've seen in PG log files, JDBC is astoundingly chatty.
[snip]
Method-4
INSERT INTO parent_table VALUES (1, 'a'), (2, 'a');
INSERT INTO child_table VALUES (1,1, 'a'), (1,2, 'a');
commit;
If I knew that I had to load a structured input data file (even if it had
parent and child records), this is how I'd do it (but probably first try
and see if "in-memory COPY INTO" is such a thing).
--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> crustacean!
On Thu, Sep 19, 2024 at 11:31 AM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:
[snip]
Method-4
INSERT INTO parent_table VALUES (1, 'a'), (2, 'a');
INSERT INTO child_table VALUES (1,1, 'a'), (1,2, 'a');
commit;If I knew that I had to load a structured input data file (even if it had
parent and child records), this is how I'd do it (but probably first try
and see if "in-memory COPY INTO" is such a thing).
I was trying to reproduce this behaviour using row by row commit vs just
batch commit vs true batch insert as you mentioned, i am not able to see
any difference between "batch commit" and "true batch insert" response. Am
I missing anything?
CREATE TABLE debug_log (
method1 TEXT,
start_time TIMESTAMP,
end_time TIMESTAMP,
elapsed_time INTERVAL
);
CREATE TABLE parent_table (
id SERIAL PRIMARY KEY,
name TEXT
);
CREATE TABLE parent_table2 (
id SERIAL PRIMARY KEY,
name TEXT
);
CREATE TABLE parent_table3 (
id SERIAL PRIMARY KEY,
name TEXT
);
DO $$
DECLARE
num_inserts INTEGER := 100000;
batch_size INTEGER := 50;
start_time TIMESTAMP;
end_time TIMESTAMP;
elapsed_time INTERVAL;
i INTEGER;
BEGIN
-- Method 1: Individual Inserts with Commit after every Row
start_time := clock_timestamp();
FOR i IN 1..num_inserts LOOP
INSERT INTO parent_table VALUES (i, 'a');
COMMIT;
END LOOP;
end_time := clock_timestamp();
elapsed_time := end_time - start_time;
INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
VALUES ('Method 1: Individual Inserts with Commit after every Row',
start_time, end_time, elapsed_time);
-- Method 2: Individual Inserts with Commit after 100 Rows
start_time := clock_timestamp();
FOR i IN 1..num_inserts LOOP
INSERT INTO parent_table2 VALUES (i, 'a');
-- Commit after every 100 rows
IF i % batch_size = 0 THEN
COMMIT;
END IF;
END LOOP;
-- Final commit if not already committed
commit;
end_time := clock_timestamp();
elapsed_time := end_time - start_time;
INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
VALUES ('Method 2: Individual Inserts with Commit after 100 Rows',
start_time, end_time, elapsed_time);
-- Method 3: Batch Inserts with Commit after all
start_time := clock_timestamp();
FOR i IN 1..(num_inserts / batch_size) LOOP
INSERT INTO parent_table3 VALUES
(1 + (i - 1) * batch_size, 'a'),
(2 + (i - 1) * batch_size, 'a'),
(3 + (i - 1) * batch_size, 'a'),
(4 + (i - 1) * batch_size, 'a'),
(5 + (i - 1) * batch_size, 'a'),
(6 + (i - 1) * batch_size, 'a'),
(7 + (i - 1) * batch_size, 'a'),
(8 + (i - 1) * batch_size, 'a'),
(9 + (i - 1) * batch_size, 'a'),
(10 + (i - 1) * batch_size, 'a'),
(11 + (i - 1) * batch_size, 'a'),
(12 + (i - 1) * batch_size, 'a'),
(13 + (i - 1) * batch_size, 'a'),
(14 + (i - 1) * batch_size, 'a'),
(15 + (i - 1) * batch_size, 'a'),
(16 + (i - 1) * batch_size, 'a'),
(17 + (i - 1) * batch_size, 'a'),
(18 + (i - 1) * batch_size, 'a'),
(19 + (i - 1) * batch_size, 'a'),
(20 + (i - 1) * batch_size, 'a'),
(21 + (i - 1) * batch_size, 'a'),
(22 + (i - 1) * batch_size, 'a'),
(23 + (i - 1) * batch_size, 'a'),
(24 + (i - 1) * batch_size, 'a'),
(25 + (i - 1) * batch_size, 'a'),
(26 + (i - 1) * batch_size, 'a'),
(27 + (i - 1) * batch_size, 'a'),
(28 + (i - 1) * batch_size, 'a'),
(29 + (i - 1) * batch_size, 'a'),
(30 + (i - 1) * batch_size, 'a'),
(31 + (i - 1) * batch_size, 'a'),
(32 + (i - 1) * batch_size, 'a'),
(33 + (i - 1) * batch_size, 'a'),
(34 + (i - 1) * batch_size, 'a'),
(35 + (i - 1) * batch_size, 'a'),
(36 + (i - 1) * batch_size, 'a'),
(37 + (i - 1) * batch_size, 'a'),
(38 + (i - 1) * batch_size, 'a'),
(39 + (i - 1) * batch_size, 'a'),
(40 + (i - 1) * batch_size, 'a'),
(41 + (i - 1) * batch_size, 'a'),
(42 + (i - 1) * batch_size, 'a'),
(43 + (i - 1) * batch_size, 'a'),
(44 + (i - 1) * batch_size, 'a'),
(45 + (i - 1) * batch_size, 'a'),
(46 + (i - 1) * batch_size, 'a'),
(47 + (i - 1) * batch_size, 'a'),
(48 + (i - 1) * batch_size, 'a'),
(49 + (i - 1) * batch_size, 'a'),
(50 + (i - 1) * batch_size, 'a'));
COMMIT;
END LOOP;
COMMIT; -- Final commit for all
end_time := clock_timestamp();
elapsed_time := end_time - start_time;
INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
VALUES ('Method 3: Batch Inserts with Commit after All', start_time,
end_time, elapsed_time);
END $$;
On Thu, Sep 19, 2024 at 5:24 AM Lok P <loknath.73@gmail.com> wrote:
On Thu, Sep 19, 2024 at 11:31 AM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:[snip]
Method-4
INSERT INTO parent_table VALUES (1, 'a'), (2, 'a');
INSERT INTO child_table VALUES (1,1, 'a'), (1,2, 'a');
commit;If I knew that I had to load a structured input data file (even if it had
parent and child records), this is how I'd do it (but probably first try
and see if "in-memory COPY INTO" is such a thing).I was trying to reproduce this behaviour using row by row commit vs just
batch commit vs true batch insert as you mentioned, i am not able to see
any difference between "batch commit" and "true batch insert" response. Am
I missing anything?[snip]
DO $$
DECLARE
num_inserts INTEGER := 100000;
batch_size INTEGER := 50;
start_time TIMESTAMP;
end_time TIMESTAMP;
elapsed_time INTERVAL;
i INTEGER;
BEGIN
-- Method 1: Individual Inserts with Commit after every Row
start_time := clock_timestamp();FOR i IN 1..num_inserts LOOP
INSERT INTO parent_table VALUES (i, 'a');
COMMIT;
END LOOP;end_time := clock_timestamp();
elapsed_time := end_time - start_time;
INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
VALUES ('Method 1: Individual Inserts with Commit after every Row',
start_time, end_time, elapsed_time);-- Method 2: Individual Inserts with Commit after 100 Rows
start_time := clock_timestamp();FOR i IN 1..num_inserts LOOP
INSERT INTO parent_table2 VALUES (i, 'a');
-- Commit after every 100 rows
IF i % batch_size = 0 THEN
COMMIT;
END IF;
END LOOP;-- Final commit if not already committed
commit;end_time := clock_timestamp();
elapsed_time := end_time - start_time;
INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
VALUES ('Method 2: Individual Inserts with Commit after 100 Rows',
start_time, end_time, elapsed_time);-- Method 3: Batch Inserts with Commit after all
start_time := clock_timestamp();FOR i IN 1..(num_inserts / batch_size) LOOP
INSERT INTO parent_table3 VALUES
(1 + (i - 1) * batch_size, 'a'),
[snip]
(49 + (i - 1) * batch_size, 'a'),
(50 + (i - 1) * batch_size, 'a'));
COMMIT;
END LOOP;COMMIT; -- Final commit for all
end_time := clock_timestamp();
elapsed_time := end_time - start_time;
INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
VALUES ('Method 3: Batch Inserts with Commit after All', start_time,
end_time, elapsed_time);END $$;
Reproduce what behavior?
Anyway, plpgsql functions (including anonymous DO statements) are -- to
Postgresql -- single statements. Thus, they'll be faster than
individual calls..
An untrusted language like plpython3u might speed things up even more, if
you have to read a heterogeneous external file and insert all the records
into the db.
--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> crustacean!
On Thu, Sep 19, 2024 at 5:40 PM Ron Johnson <ronljohnsonjr@gmail.com> wrote:
On Thu, Sep 19, 2024 at 5:24 AM Lok P <loknath.73@gmail.com> wrote:
[snip]
DO $$
DECLARE
num_inserts INTEGER := 100000;
batch_size INTEGER := 50;
start_time TIMESTAMP;
end_time TIMESTAMP;
elapsed_time INTERVAL;
i INTEGER;
BEGIN
-- Method 1: Individual Inserts with Commit after every Row
start_time := clock_timestamp();FOR i IN 1..num_inserts LOOP
INSERT INTO parent_table VALUES (i, 'a');
COMMIT;
END LOOP;end_time := clock_timestamp();
elapsed_time := end_time - start_time;
INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
VALUES ('Method 1: Individual Inserts with Commit after every Row',
start_time, end_time, elapsed_time);-- Method 2: Individual Inserts with Commit after 100 Rows
start_time := clock_timestamp();FOR i IN 1..num_inserts LOOP
INSERT INTO parent_table2 VALUES (i, 'a');
-- Commit after every 100 rows
IF i % batch_size = 0 THEN
COMMIT;
END IF;
END LOOP;-- Final commit if not already committed
commit;end_time := clock_timestamp();
elapsed_time := end_time - start_time;
INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
VALUES ('Method 2: Individual Inserts with Commit after 100 Rows',
start_time, end_time, elapsed_time);-- Method 3: Batch Inserts with Commit after all
start_time := clock_timestamp();FOR i IN 1..(num_inserts / batch_size) LOOP
INSERT INTO parent_table3 VALUES
(1 + (i - 1) * batch_size, 'a'),[snip]
(49 + (i - 1) * batch_size, 'a'),
(50 + (i - 1) * batch_size, 'a'));
COMMIT;
END LOOP;COMMIT; -- Final commit for all
end_time := clock_timestamp();
elapsed_time := end_time - start_time;
INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
VALUES ('Method 3: Batch Inserts with Commit after All', start_time,
end_time, elapsed_time);END $$;
Reproduce what behavior?
Anyway, plpgsql functions (including anonymous DO statements) are -- to
Postgresql -- single statements. Thus, they'll be faster than
individual calls..An untrusted language like plpython3u might speed things up even more, if
you have to read a heterogeneous external file and insert all the records
into the db.
Here if you see my script , the method-1 is doing commit after each row
insert. And method-2 is doing a batch commit i.e. commit after every "50"
row. And method-3 is doing a true batch insert i.e. combining all the 50
values in one insert statement and submitting to the database in oneshot
and then COMMIT it, so the context switching will be a lot less. So I was
expecting Method-3 to be the fastest way to insert the rows here, but the
response time shows the same response time for Method-2 and method-3.
Method-1 is the slowest through.
Below are the results for the posted methods. Tested it on local and it
gave no difference in timing between the method-2 andmethod-3. Failed to
run in dbfiddle somehow.
Also I was initially worried if adding the trigger to the our target table,
will worsen the performance as because , it will make all the execution to
"row by row" rather a true batch insert(method-3 as posted) as there will
be more number of context switches , but it seems it will still be doing
the batch commits(like the way its in method-2). So as per that , we won't
lose any performance as such. Is this understanding correct?
*Method-1- 00:01:44.48*
*Method-2- 00:00:02.67*
*Method-3- 00:00:02.39*
https://gist.github.com/databasetech0073/8e9106757d751358c0c0c65a2374dbc6
On Thu, Sep 19, 2024 at 6:42 PM Lok P <loknath.73@gmail.com> wrote:
Show quoted text
On Thu, Sep 19, 2024 at 5:40 PM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:On Thu, Sep 19, 2024 at 5:24 AM Lok P <loknath.73@gmail.com> wrote:
[snip]
DO $$
DECLARE
num_inserts INTEGER := 100000;
batch_size INTEGER := 50;
start_time TIMESTAMP;
end_time TIMESTAMP;
elapsed_time INTERVAL;
i INTEGER;
BEGIN
-- Method 1: Individual Inserts with Commit after every Row
start_time := clock_timestamp();FOR i IN 1..num_inserts LOOP
INSERT INTO parent_table VALUES (i, 'a');
COMMIT;
END LOOP;end_time := clock_timestamp();
elapsed_time := end_time - start_time;
INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
VALUES ('Method 1: Individual Inserts with Commit after every Row',
start_time, end_time, elapsed_time);-- Method 2: Individual Inserts with Commit after 100 Rows
start_time := clock_timestamp();FOR i IN 1..num_inserts LOOP
INSERT INTO parent_table2 VALUES (i, 'a');
-- Commit after every 100 rows
IF i % batch_size = 0 THEN
COMMIT;
END IF;
END LOOP;-- Final commit if not already committed
commit;end_time := clock_timestamp();
elapsed_time := end_time - start_time;
INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
VALUES ('Method 2: Individual Inserts with Commit after 100 Rows',
start_time, end_time, elapsed_time);-- Method 3: Batch Inserts with Commit after all
start_time := clock_timestamp();FOR i IN 1..(num_inserts / batch_size) LOOP
INSERT INTO parent_table3 VALUES
(1 + (i - 1) * batch_size, 'a'),[snip]
(49 + (i - 1) * batch_size, 'a'),
(50 + (i - 1) * batch_size, 'a'));
COMMIT;
END LOOP;COMMIT; -- Final commit for all
end_time := clock_timestamp();
elapsed_time := end_time - start_time;
INSERT INTO debug_log (method1, start_time, end_time, elapsed_time)
VALUES ('Method 3: Batch Inserts with Commit after All', start_time,
end_time, elapsed_time);END $$;
Reproduce what behavior?
Anyway, plpgsql functions (including anonymous DO statements) are -- to
Postgresql -- single statements. Thus, they'll be faster than
individual calls..An untrusted language like plpython3u might speed things up even more, if
you have to read a heterogeneous external file and insert all the records
into the db.Here if you see my script , the method-1 is doing commit after each row
insert. And method-2 is doing a batch commit i.e. commit after every "50"
row. And method-3 is doing a true batch insert i.e. combining all the 50
values in one insert statement and submitting to the database in oneshot
and then COMMIT it, so the context switching will be a lot less. So I was
expecting Method-3 to be the fastest way to insert the rows here, but the
response time shows the same response time for Method-2 and method-3.
Method-1 is the slowest through.
Hi,
On 19 Sep 2024, at 07:30, Lok P <loknath.73@gmail.com> wrote:
[snip]
Method-4
INSERT INTO parent_table VALUES (1, 'a'), (2, 'a');
INSERT INTO child_table VALUES (1,1, 'a'), (1,2, 'a');
commit;
I’ve done some batch processing of JSON messages from Kafka in Java.
By far the most performant way was to:
1. Use prepared statements
2. Parse JSON messages in Postgres
3. Process messages in batches
All three can be achieved by using arrays to pass batches:
WITH parsed AS (
SELECT msg::json FROM unnest(?)
),
parents AS (
INSERT INTO parent SELECT … FROM parsed RETURNING ...
)
INSERT INTO child SELECT … FROM parsed…
Not the single parameter that you can bind to String[]
Hope that helps.
--
Michal
On Sat, Sep 21, 2024 at 9:51 AM Michał Kłeczek <michal@kleczek.org> wrote:
Hi,
On 19 Sep 2024, at 07:30, Lok P <loknath.73@gmail.com> wrote:
[snip]
Method-4
INSERT INTO parent_table VALUES (1, 'a'), (2, 'a');
INSERT INTO child_table VALUES (1,1, 'a'), (1,2, 'a');
commit;I’ve done some batch processing of JSON messages from Kafka in Java.
By far the most performant way was to:1. Use prepared statements
2. Parse JSON messages in Postgres
3. Process messages in batchesAll three can be achieved by using arrays to pass batches:
WITH parsed AS (
SELECT msg::json FROM unnest(?)
),
parents AS (
INSERT INTO parent SELECT … FROM parsed RETURNING ...
)
INSERT INTO child SELECT … FROM parsed…Not the single parameter that you can bind to String[]
Hope that helps.
Got your point.
But wondering why we don't see any difference in performance between
method-2 and method-3 above. So does it mean that,I am testing this in a
wrong way or it's the expected behaviour and thus there is no meaning in
converting the row by row inserts into a bulk insert, but just changing the
commit frequency will do the same job in a row by row insert approach?
On 2024-09-21 16:44:08 +0530, Lok P wrote:
But wondering why we don't see any difference in performance between method-2
and method-3 above.
The code runs completely inside the database. So there isn't much
difference between a single statement which inserts 50 rows and 50
statements which insert 1 row each. The work to be done is (almost) the
same.
This changes once you consider an application which runs outside of the
database (maybe even on a different host). Such an application has to
wait for the result of each statement before it can send the next one.
Now it makes a difference whether you are waiting 50 times for a
statement which does very little or just once for a statement which does
more work.
So does it mean that,I am testing this in a wrong way or
That depends on what you want to test. If you are interested in the
behaviour of stored procedures, the test is correct. If you want to know
about the performance of a database client (whether its written in Java,
Python, Go or whatever), this is the wrong test. You have to write the
test in your target language and run it on the client system to get
realistic results (for example, the round-trip times will be a lot
shorter if the client and database are on the same computer than when
one is in Europe and the other in America).
For example, here are the three methods as Python scripts:
---------------------------------------------------------------------------------------------------
#!/usr/bin/python3
import time
import psycopg2
num_inserts = 10_000
db = psycopg2.connect()
csr = db.cursor()
csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
start_time = time.monotonic()
for i in range(1, num_inserts+1):
csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
db.commit()
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 1: Individual Inserts with Commit after every Row: {elapsed_time:.3} seconds")
# vim: tw=99
---------------------------------------------------------------------------------------------------
#!/usr/bin/python3
import time
import psycopg2
num_inserts = 10_000
batch_size = 50
db = psycopg2.connect()
csr = db.cursor()
csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()
start_time = time.monotonic()
for i in range(1, num_inserts+1):
csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
if i % batch_size == 0:
db.commit()
db.commit()
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 2: Individual Inserts with Commit after {batch_size} Rows: {elapsed_time:.3} seconds")
# vim: tw=99
---------------------------------------------------------------------------------------------------
#!/usr/bin/python3
import itertools
import time
import psycopg2
num_inserts = 10_000
batch_size = 50
db = psycopg2.connect()
csr = db.cursor()
csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()
start_time = time.monotonic()
batch = []
for i in range(1, num_inserts+1):
batch.append((i, 'a'))
if i % batch_size == 0:
q = "insert into parent_table values" + ",".join(["(%s, %s)"] * len(batch))
params = list(itertools.chain.from_iterable(batch))
csr.execute(q, params)
db.commit()
batch = []
if batch:
q = "insert into parent_table values" + ",".join(["(%s, %s)"] * len(batch))
csr.execute(q, list(itertools.chain(batch)))
db.commit()
batch = []
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 3: Batch Inserts ({batch_size}) with Commit after each batch: {elapsed_time:.3} seconds")
# vim: tw=99
---------------------------------------------------------------------------------------------------
On my laptop, method2 is about twice as fast as method3. But if I
connect to a database on the other side of the city, method2 is now more
than 16 times faster than method3 . Simply because the delay in
communication is now large compared to the time it takes to insert those
rows.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
On Sat, Sep 21, 2024 at 8:07 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2024-09-21 16:44:08 +0530, Lok P wrote:
But wondering why we don't see any difference in performance between
method-2
and method-3 above.
The code runs completely inside the database. So there isn't much
difference between a single statement which inserts 50 rows and 50
statements which insert 1 row each. The work to be done is (almost) the
same.This changes once you consider an application which runs outside of the
database (maybe even on a different host). Such an application has to
wait for the result of each statement before it can send the next one.
Now it makes a difference whether you are waiting 50 times for a
statement which does very little or just once for a statement which does
more work.So does it mean that,I am testing this in a wrong way or
That depends on what you want to test. If you are interested in the
behaviour of stored procedures, the test is correct. If you want to know
about the performance of a database client (whether its written in Java,
Python, Go or whatever), this is the wrong test. You have to write the
test in your target language and run it on the client system to get
realistic results (for example, the round-trip times will be a lot
shorter if the client and database are on the same computer than when
one is in Europe and the other in America).For example, here are the three methods as Python scripts:
---------------------------------------------------------------------------------------------------
#!/usr/bin/python3import time
import psycopg2num_inserts = 10_000
db = psycopg2.connect()
csr = db.cursor()csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")start_time = time.monotonic()
for i in range(1, num_inserts+1):
csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
db.commit()
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 1: Individual Inserts with Commit after every Row:
{elapsed_time:.3} seconds")# vim: tw=99
---------------------------------------------------------------------------------------------------
#!/usr/bin/python3import time
import psycopg2num_inserts = 10_000
batch_size = 50db = psycopg2.connect()
csr = db.cursor()csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()start_time = time.monotonic()
for i in range(1, num_inserts+1):
csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
if i % batch_size == 0:
db.commit()
db.commit()
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 2: Individual Inserts with Commit after {batch_size} Rows:
{elapsed_time:.3} seconds")# vim: tw=99
---------------------------------------------------------------------------------------------------
#!/usr/bin/python3import itertools
import time
import psycopg2num_inserts = 10_000
batch_size = 50db = psycopg2.connect()
csr = db.cursor()csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()start_time = time.monotonic()
batch = []
for i in range(1, num_inserts+1):
batch.append((i, 'a'))
if i % batch_size == 0:
q = "insert into parent_table values" + ",".join(["(%s, %s)"] *
len(batch))
params = list(itertools.chain.from_iterable(batch))
csr.execute(q, params)
db.commit()
batch = []
if batch:
q = "insert into parent_table values" + ",".join(["(%s, %s)"] *
len(batch))
csr.execute(q, list(itertools.chain(batch)))
db.commit()
batch = []end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 3: Batch Inserts ({batch_size}) with Commit after each
batch: {elapsed_time:.3} seconds")# vim: tw=99
---------------------------------------------------------------------------------------------------
On my laptop, method2 is about twice as fast as method3. But if I
connect to a database on the other side of the city, method2 is now more
than 16 times faster than method3 . Simply because the delay in
communication is now large compared to the time it takes to insert those
rows.
Thank you so much.
I was expecting method-3(batch insert) to be the fastest or atleast as you
said perform with similar speed as method-2 (row by row insert with batch
commit) if we do it within the procedure inside the database. But because
the context switching will be minimal in method-3 as it will prepare the
insert and submit to the database in one shot in one DB call, so it should
be a bit fast. But from your figures , it appears to be the opposite ,
i.e.method-2 is faster than method-3. Not able to understand the reason
though. So in this case then ,it appears we can follow method-2 as that is
cheaper in regards to less code change , i.e. just shifting the commit
points without any changes for doing the batch insert.
Btw,Do you have any thoughts, why method-2 is faster as compared to
method-3 in your test?
On 9/21/24 07:36, Peter J. Holzer wrote:
On 2024-09-21 16:44:08 +0530, Lok P wrote:
---------------------------------------------------------------------------------------------------
#!/usr/bin/python3import time
import psycopg2num_inserts = 10_000
batch_size = 50db = psycopg2.connect()
csr = db.cursor()csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()start_time = time.monotonic()
for i in range(1, num_inserts+1):
csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
if i % batch_size == 0:
db.commit()
db.commit()
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 2: Individual Inserts with Commit after {batch_size} Rows: {elapsed_time:.3} seconds")# vim: tw=99
---------------------------------------------------------------------------------------------------
FYI, this is less of problem with psycopg(3) and pipeline mode:
import time
import psycopg
num_inserts = 10_000
batch_size = 50
db = psycopg.connect("dbname=test user=postgres host=104.237.158.68")
csr = db.cursor()
csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()
start_time = time.monotonic()
with db.pipeline():
for i in range(1, num_inserts+1):
csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
if i % batch_size == 0:
db.commit()
db.commit()
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 2: Individual Inserts(psycopg3 pipeline mode) with Commit
after {batch_size} Rows: {elapsed_time:.3} seconds")
For remote to a database in another state that took the time from:
Method 2: Individual Inserts with Commit after 50 Rows: 2.42e+02 seconds
to:
Method 2: Individual Inserts(psycopg3 pipeline mode) with Commit after
50 Rows: 9.83 seconds
#!/usr/bin/python3
import itertools
import time
import psycopg2num_inserts = 10_000
batch_size = 50db = psycopg2.connect()
csr = db.cursor()csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()start_time = time.monotonic()
batch = []
for i in range(1, num_inserts+1):
batch.append((i, 'a'))
if i % batch_size == 0:
q = "insert into parent_table values" + ",".join(["(%s, %s)"] * len(batch))
params = list(itertools.chain.from_iterable(batch))
csr.execute(q, params)
db.commit()
batch = []
if batch:
q = "insert into parent_table values" + ",".join(["(%s, %s)"] * len(batch))
csr.execute(q, list(itertools.chain(batch)))
db.commit()
batch = []end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 3: Batch Inserts ({batch_size}) with Commit after each batch: {elapsed_time:.3} seconds")# vim: tw=99
---------------------------------------------------------------------------------------------------
The above can also be handled with execute_batch() and execute_values()
from:
https://www.psycopg.org/docs/extras.html#fast-execution-helpers
On my laptop, method2 is about twice as fast as method3. But if I
connect to a database on the other side of the city, method2 is now more
than 16 times faster than method3 . Simply because the delay in
communication is now large compared to the time it takes to insert those
rows.hp
--
Adrian Klaver
adrian.klaver@aklaver.com
On Sun, Sep 22, 2024 at 12:46 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:
On 9/21/24 07:36, Peter J. Holzer wrote:
On 2024-09-21 16:44:08 +0530, Lok P wrote:
---------------------------------------------------------------------------------------------------
#!/usr/bin/python3
import time
import psycopg2num_inserts = 10_000
batch_size = 50db = psycopg2.connect()
csr = db.cursor()csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()start_time = time.monotonic()
for i in range(1, num_inserts+1):
csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
if i % batch_size == 0:
db.commit()
db.commit()
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 2: Individual Inserts with Commit after {batch_size}Rows: {elapsed_time:.3} seconds")
# vim: tw=99
---------------------------------------------------------------------------------------------------
FYI, this is less of problem with psycopg(3) and pipeline mode:
import time
import psycopgnum_inserts = 10_000
batch_size = 50db = psycopg.connect("dbname=test user=postgres host=104.237.158.68")
csr = db.cursor()csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()start_time = time.monotonic()
with db.pipeline():
for i in range(1, num_inserts+1):
csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
if i % batch_size == 0:
db.commit()
db.commit()
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 2: Individual Inserts(psycopg3 pipeline mode) with Commit
after {batch_size} Rows: {elapsed_time:.3} seconds")For remote to a database in another state that took the time from:
Method 2: Individual Inserts with Commit after 50 Rows: 2.42e+02 seconds
to:
Method 2: Individual Inserts(psycopg3 pipeline mode) with Commit after
50 Rows: 9.83 seconds#!/usr/bin/python3
import itertools
import time
import psycopg2num_inserts = 10_000
batch_size = 50db = psycopg2.connect()
csr = db.cursor()csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()start_time = time.monotonic()
batch = []
for i in range(1, num_inserts+1):
batch.append((i, 'a'))
if i % batch_size == 0:
q = "insert into parent_table values" + ",".join(["(%s, %s)"] *len(batch))
params = list(itertools.chain.from_iterable(batch))
csr.execute(q, params)
db.commit()
batch = []
if batch:
q = "insert into parent_table values" + ",".join(["(%s, %s)"] *len(batch))
csr.execute(q, list(itertools.chain(batch)))
db.commit()
batch = []end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 3: Batch Inserts ({batch_size}) with Commit after eachbatch: {elapsed_time:.3} seconds")
# vim: tw=99
---------------------------------------------------------------------------------------------------
The above can also be handled with execute_batch() and execute_values()
from:https://www.psycopg.org/docs/extras.html#fast-execution-helpers
On my laptop, method2 is about twice as fast as method3. But if I
connect to a database on the other side of the city, method2 is now more
than 16 times faster than method3 . Simply because the delay in
communication is now large compared to the time it takes to insert those
rows.
Thank you. So if I get it correct, if the client app(from which the data is
getting streamed/inserted) is in the same data center/zone as the database
(which is most of the time the case) then the batch insert does not appear
to be much beneficial.
Which also means , people here were afraid of having triggers in such a
high dml table as because this will make the "batch insert" automatically
converted into "row by row" behind the scene, but considering the
above results, it looks fine to go with a row by row approach (but just
having batch commit in place in place of row by row commit). And not to
worry about implementing the true batch insert approach as that is not
making a big difference here in data load performance.
On 9/21/24 22:21, Lok P wrote:
On Sun, Sep 22, 2024 at 12:46 AM Adrian Klaver
<adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote:
Thank you. So if I get it correct, if the client app(from which the data
is getting streamed/inserted) is in the same data center/zone as the
database (which is most of the time the case) then the batch insert does
not appear to be much beneficial.
No, the point is that once the client and the database are not on the
same machine the network that they communicate across becomes a
consideration. In a contrived example the client could be in the same
same data center as the database server and talking to the server via a
dialup modem and the data transfer would be worse then the same client
talking to a database server a 1000 miles away across a fiber optic
connection. This gets back to plan --> test.
/|\ |
| <-- \|/
Which also means , people here were afraid of having triggers in such a
high dml table as because this will make the "batch insert"
automatically converted into "row by row" behind the scene, but
considering the above results, it looks fine to go with a row by row
approach (but just having batch commit in place in place of row by row
commit). And not to worry about implementing the true batch insert
approach as that is not making a big difference here in data load
performance.
This is getting ahead of the game. The immediate issue is the deadlock
issues with the concurrent sessions and duplicate data.
--
Adrian Klaver
adrian.klaver@aklaver.com
On 2024-09-21 20:55:13 +0530, Lok P wrote:
On Sat, Sep 21, 2024 at 8:07 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
[... lots of code elided. method2 used 1 insert per row, method3 1
insert for 50 rows ...]
On my laptop, method2 is about twice as fast as method3. But if I
connect to a database on the other side of the city, method2 is now more
than 16 times faster than method3 . Simply because the delay in
communication is now large compared to the time it takes to insert those
rows.Thank you so much.
I was expecting method-3(batch insert) to be the fastest or atleast as you said
perform with similar speed as method-2 (row by row insert with batch commit)
Oops, sorry! I wrote that the wrong way around. Method 3 is the fastest.
I guess I meant to write "method2 takes about twice as long as method3"
or something like that.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
On 2024-09-21 12:15:44 -0700, Adrian Klaver wrote:
FYI, this is less of problem with psycopg(3) and pipeline mode:
[...]
with db.pipeline():
for i in range(1, num_inserts+1):
csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
if i % batch_size == 0:
db.commit()
db.commit()
[...]
For remote to a database in another state that took the time from:
Method 2: Individual Inserts with Commit after 50 Rows: 2.42e+02 seconds
to:
Method 2: Individual Inserts(psycopg3 pipeline mode) with Commit after 50
Rows: 9.83 seconds
Very cool. I'll keep that in mind.
I've been using psycopg 3 for newer projects, but for throwaway code
I've been sticking to psycopg2, simply because it's available from the
repos of all my usual distributions. It's now in both Debian and Ubuntu,
so that will change.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
On Mon, Sep 23, 2024 at 12:53 AM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
Thank you so much.
I was expecting method-3(batch insert) to be the fastest or atleast asyou said
perform with similar speed as method-2 (row by row insert with batch
commit)
Oops, sorry! I wrote that the wrong way around. Method 3 is the fastest.
I guess I meant to write "method2 takes about twice as long as method3"
or something like that.
As in case of batch insert below is the fastest one as it inserts
multiple rows in one statement. Similarly I understand, Delete can be
batched as below. However, can you suggest how an Update can be batched in
a simple/generic fashion in JDBC for an input data stream with multiple
input values. As because for an update if we write as below , it will just
do one row update at a time?
Update <table_name> SET column1=?, column2=?, column3=? where
<PK_Column>=? ;
INSERT INTO <table_name> VALUES (1, 'a'), (2, 'a'),(3,'a');
Delete from <table_name> where column_name in (<value1>,
<value2>,<value3>...);
On 10/4/24 1:05 PM, Lok P wrote:
On Mon, Sep 23, 2024 at 12:53 AM Peter J. Holzer <hjp-pgsql@hjp.at
<mailto:hjp-pgsql@hjp.at>> wrote:Thank you so much.
I was expecting method-3(batch insert) to be the fastest oratleast as you said
perform with similar speed as method-2 (row by row insert with
batch commit)
Oops, sorry! I wrote that the wrong way around. Method 3 is the fastest.
I guess I meant to write "method2 takes about twice as long as method3"
or something like that.As in case of batch insert below is the fastest one as it inserts
multiple rows in one statement. Similarly I understand, Delete can be
batched as below. However, can you suggest how an Update can be batched
in a simple/generic fashion in JDBC for an input data stream with
multiple input values. As because for an update if we write as below ,
it will just do one row update at a time?Update <table_name> SET column1=?, column2=?, column3=? where
<PK_Column>=? ;
UPDATE table_name SET column1 = vals.text_val, column2=int_val FROM
(VALUES (1, 'dog', 23),(2, 'cat', 44)) AS vals (id, text_val, int_val)
where table_name.id = vals.id;
INSERT INTO <table_name> VALUES (1, 'a'), (2, 'a'),(3,'a');
Delete from <table_name> where column_name in (<value1>,
<value2>,<value3>...);
--
Adrian Klaver
adrian.klaver@aklaver.com