Batch update million records in prd DB

Started by Yi Sunabout 5 years ago9 messagesgeneral
Jump to latest
#1Yi Sun
yinan81@gmail.com

Hello,

Now need to update several million records in a table in prd DB, if can use
batch update 1000 records and commit each time, if it will affect prd
application like below sample script please?

Sample script:

DO $MAIN$
DECLARE
affect_count integer;
chunk_size CONSTANT integer :=1000;
sleep_sec CONSTANT numeric :=0.1;
BEGIN

loop

exit when affect_count=0;

UPDATE tbl a
SET name = ''
WHERE a.id IN (SELECT id
FROM tbl b
WHERE name IS NULL
LIMIT chunk_size);

GET DIAGNOSTICS affect_count = ROW_COUNT;

commit;

PERFORM pg_sleep(sleep_sec);

end loop;
END;
$MAIN$;

Thanks and best regards

#2Michael Lewis
mlewis@entrata.com
In reply to: Yi Sun (#1)
Re: Batch update million records in prd DB

Of course it will impact a system using that table, but not significant I
expect and the production system should handle it. If you are committing
like this, then you can kill the script at any time and not lose any work.
The query to find the next IDs to update is probably the slowest part of
this depending on what indexes you have.

#3Yi Sun
yinan81@gmail.com
In reply to: Michael Lewis (#2)
Re: Batch update million records in prd DB

Hi Michael,

Thank you for your reply

We found that each loop take time is different, it will become slower and
slower, as our table is big table and join other table, even using index
the last 1000 records take around 15 seconds, will it be a problem? Will
other concurrent update have to wait for 15 second until lock release?

Thanks and best regards

Michael Lewis <mlewis@entrata.com> 于2021年2月24日周三 下午11:47写道:

Show quoted text

Of course it will impact a system using that table, but not significant I
expect and the production system should handle it. If you are committing
like this, then you can kill the script at any time and not lose any work.
The query to find the next IDs to update is probably the slowest part of
this depending on what indexes you have.

#4Michael Lewis
mlewis@entrata.com
In reply to: Yi Sun (#3)
Re: Batch update million records in prd DB

It might be a concern, but generally that should be a row level lock and
only block other update/delete options on those rows. It might be helpful
to look at the explain analyze output early on vs later in the process. It
might be that you are getting very few hot updates and indexes are being
updated constantly.

Show quoted text
#5Yi Sun
yinan81@gmail.com
In reply to: Michael Lewis (#4)
Re: Batch update million records in prd DB

Hi Michael

This is the script and explain plan info, please check, seems Filter
remove more records took more time

DO $MAIN$
DECLARE
affect_count integer := 1000;
processed_row_count integer := 0;
BEGIN
LOOP
exit
WHEN affect_count = 0;
UPDATE
app gaa
SET
deleted_at = (
SELECT
CAST(extract(epoch FROM now() at time zone 'utc') *
1000000000 AS bigint))
WHERE
gaa.id IN (
SELECT
gab.id
FROM
app gab
LEFT JOIN pol gp ON gab.policy_id = gp.id
WHERE
gab.policy_type = 'policy.protection.total'
AND gp.name LIKE 'Mobile backup%'
AND gab.deleted_at IS NULL
AND gp.deleted_at IS NOT NULL
LIMIT 1000);
GET DIAGNOSTICS affect_count = ROW_COUNT;
COMMIT;
processed_row_count = processed_row_count + affect_count;
END LOOP;
RAISE NOTICE 'total processed rows %', processed_row_count;
END;
$MAIN$;

--early explain plan, 1000 records update take 156.488 ms
--------------------------------------------------------------------
Update on app gaa (cost=3307.57..6085.41 rows=1000 width=3943) (actual
time=156.347..156.347 rows=0 loops=1)
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.02 rows=1 width=8) (actual time=0.007..0.008
rows=1 loops=1)
-> Nested Loop (cost=3307.54..6085.39 rows=1000 width=3943) (actual
time=18.599..33.987 rows=1000 loops=1)
-> HashAggregate (cost=3306.99..3316.99 rows=1000 width=98)
(actual time=18.554..19.085 rows=1000 loops=1)
Group Key: ("ANY_subquery".id)::text
-> Subquery Scan on "ANY_subquery" (cost=2.17..3304.49
rows=1000 width=98) (actual time=0.041..18.052 rows=1000 loops=1)
-> Limit (cost=2.17..3294.49 rows=1000 width=37)
(actual time=0.030..17.827 rows=1000 loops=1)
-> Merge Join (cost=2.17..877396.03
rows=266497 width=37) (actual time=0.029..17.764 rows=1000 loops=1)
Merge Cond: ((gab.policy_id)::text = (gp.id
)::text)
-> Index Scan using
tmp_uq_policy_id_context2 on app gab (cost=0.56..487631.06 rows=3151167
width=74) (actual time=0.018..9.192 rows=3542 loops=1)
Filter: ((policy_type)::text =
'policy.protection.total'::text)
Rows Removed by Filter: 2064
-> Index Scan using pol_pkey on pol gp
(cost=0.56..378322.78 rows=361105 width=37) (actual time=0.008..7.380
rows=1006 loops=1)
Filter: ((deleted_at IS NOT NULL)
AND (name ~~ 'Mobile backup%'::text))
Rows Removed by Filter: 3502
-> Index Scan using app2_pkey on app gaa (cost=0.56..2.77 rows=1
width=3874) (actual time=0.014..0.014 rows=1 loops=1000)
Index Cond: ((id)::text = ("ANY_subquery".id)::text)
Planning Time: 0.852 ms
Execution Time: 156.488 ms

--later explain plan, 1000 records update take 13301.600 ms
--------------------------------------------------------------------------
Update on app gaa (cost=3789.35..6567.19 rows=1000 width=3980) (actual
time=13301.466..13301.466 rows=0 loops=1)
InitPlan 1 (returns $0)
-> Result (cost=0.00..0.02 rows=1 width=8) (actual time=0.007..0.007
rows=1 loops=1)
-> Nested Loop (cost=3789.32..6567.17 rows=1000 width=3980) (actual
time=12881.004..12896.440 rows=1000 loops=1)
-> HashAggregate (cost=3788.77..3798.77 rows=1000 width=98)
(actual time=12880.958..12881.378 rows=1000 loops=1)
Group Key: ("ANY_subquery".id)::text
-> Subquery Scan on "ANY_subquery" (cost=2.17..3786.27
rows=1000 width=98) (actual time=12850.663..12880.505 rows=1000 loops=1)
-> Limit (cost=2.17..3776.27 rows=1000 width=37)
(actual time=12850.656..12880.233 rows=1000 loops=1)
-> Merge Join (cost=2.17..862421.74
rows=228510 width=37) (actual time=12850.655..12880.162 rows=1000 loops=1)
Merge Cond: ((gab.policy_id)::text = (gp.id
)::text)
-> Index Scan using
tmp_uq_policy_id_context2 on app gab (cost=0.56..474159.31 rows=2701994
width=74) (actual time=0.017..6054.269 rows=2302988 loops=1)
Filter: ((policy_type)::text =
'policy.protection.total'::text)
Rows Removed by Filter: 1822946
-> Index Scan using pol_pkey on pol gp
(cost=0.56..378322.78 rows=361105 width=37) (actual time=0.007..5976.346
rows=936686 loops=1)
Filter: ((deleted_at IS NOT NULL)
AND (name ~~ 'Mobile backup%'::text))
Rows Removed by Filter: 3152553
-> Index Scan using app2_pkey on app gaa (cost=0.56..2.77 rows=1
width=3911) (actual time=0.014..0.014 rows=1 loops=1000)
Index Cond: ((id)::text = ("ANY_subquery".id)::text)
Planning Time: 0.785 ms
Execution Time: 13301.600 ms

--we also choose a temporary table solution to test, script as below
DO $MAIN$
DECLARE
affect_count integer;
offset_count integer:=0;
chunk_size CONSTANT integer :=1000;
sleep_sec CONSTANT numeric :=0.1;
BEGIN

DROP TABLE IF EXISTS tmp_usage_tbl;
CREATE TEMPORARY TABLE tmp_usage_tbl(id character varying(36));

INSERT INTO tmp_usage_tbl(id)
SELECT
gab.id
FROM
app gab
LEFT JOIN pol gp ON gab.policy_id = gp.id
WHERE
gab.policy_type = 'policy.protection.total'
AND gp.name LIKE 'Mobile backup%'
AND gab.deleted_at IS NULL
AND gp.deleted_at IS NOT NULL;

loop

exit when affect_count=0;

UPDATE
app gaa
SET
deleted_at = (
SELECT
CAST(extract(epoch FROM now() at time zone 'utc') *
1000000000 AS bigint))
WHERE gaa.id IN (SELECT id
FROM tmp_usage_tbl
order by id
LIMIT chunk_size offset offset_count);

GET DIAGNOSTICS affect_count = ROW_COUNT;

commit;

offset_count:=offset_count+chunk_size;

PERFORM pg_sleep(sleep_sec);

end loop;

END;
$MAIN$;

--1000 records update take around 2000 ms(each time same as use temporay
table)

which solution is better please?

Michael Lewis <mlewis@entrata.com> 于2021年2月27日周六 上午1:46写道:

Show quoted text

It might be a concern, but generally that should be a row level lock and
only block other update/delete options on those rows. It might be helpful
to look at the explain analyze output early on vs later in the process. It
might be that you are getting very few hot updates and indexes are being
updated constantly.

#6Michael Lewis
mlewis@entrata.com
In reply to: Yi Sun (#5)
Re: Batch update million records in prd DB

1) Don't pretend it is a left join when your where clause will turn it into
an INNER join.
LEFT JOIN pol gp ON gab.policy_id = gp.id
WHERE

* AND gp.name <http://gp.name&gt; LIKE 'Mobile backup%'
AND gp.deleted_at IS NOT NULL;*

2) It is interesting to me that the row estimates are stable, but the
number of rows filtered out and that are found by those two index
scans changes so dramatically. Is your underlying data changing
significantly during this run? Maybe I am not seeing something that should
be obvious.

3) What is the execution plan for the update based on the temp table? It is
hard to believe it takes 2 seconds to update 1000 rows. By the way, that
temp table needs to be analyzed after it is created & populated with data,
or the planner won't know how many rows it contains or any other stats
about it. One advantage of the temp table should be that you have already
found all the candidate rows and so the time that locks are held to update
the 1000 target rows is smaller. Given you are doing a order by & limit in
the use of the temp table, I might actually create an index on the id
column to help the later runs. The temp table should likely remain in
memory (temp_buffers) but still, btree is nice for ordered use.

Show quoted text
#7Kristjan Mustkivi
sonicmonkey@gmail.com
In reply to: Yi Sun (#1)
Re: Batch update million records in prd DB

Hi Yi,

I found that in postgres the memory is slowly eaten away when doing
updates within plsql loop. It only gets released once the whole block
completes. While it is ok for small tables you will eventually run out
of memory for really big ones. The working approach was to do the loop
in e.g a python script that called the DML statements and also called
commit. Several million rows is fortunately relatively small number to
update but once you get to billions this approach would not likely
work. Note that after each batch you also should call VACUUM before
starting a new one to avoid significant table bloat.

BR,

Kristjan

On Wed, Feb 24, 2021 at 3:01 PM Yi Sun <yinan81@gmail.com> wrote:

Hello,

Now need to update several million records in a table in prd DB, if can use batch update 1000 records and commit each time, if it will affect prd application like below sample script please?

Sample script:

DO $MAIN$
DECLARE
affect_count integer;
chunk_size CONSTANT integer :=1000;
sleep_sec CONSTANT numeric :=0.1;
BEGIN

loop

exit when affect_count=0;

UPDATE tbl a
SET name = ''
WHERE a.id IN (SELECT id
FROM tbl b
WHERE name IS NULL
LIMIT chunk_size);

GET DIAGNOSTICS affect_count = ROW_COUNT;

commit;

PERFORM pg_sleep(sleep_sec);

end loop;
END;
$MAIN$;

Thanks and best regards

--
Kristjan Mustkivi

Email: kristjan.mustkivi@gmail.com

#8Yi Sun
yinan81@gmail.com
In reply to: Michael Lewis (#6)
Re: Batch update million records in prd DB

Hi Michael,

Thank you, after create index to the temp table column, time cost become
smaller

Michael Lewis <mlewis@entrata.com> 于2021年3月2日周二 上午12:08写道:

Show quoted text

1) Don't pretend it is a left join when your where clause will turn it
into an INNER join.
LEFT JOIN pol gp ON gab.policy_id = gp.id
WHERE

* AND gp.name <http://gp.name&gt; LIKE 'Mobile backup%'
AND gp.deleted_at IS NOT NULL;*

2) It is interesting to me that the row estimates are stable, but the
number of rows filtered out and that are found by those two index
scans changes so dramatically. Is your underlying data changing
significantly during this run? Maybe I am not seeing something that should
be obvious.

3) What is the execution plan for the update based on the temp table? It
is hard to believe it takes 2 seconds to update 1000 rows. By the way, that
temp table needs to be analyzed after it is created & populated with data,
or the planner won't know how many rows it contains or any other stats
about it. One advantage of the temp table should be that you have already
found all the candidate rows and so the time that locks are held to update
the 1000 target rows is smaller. Given you are doing a order by & limit in
the use of the temp table, I might actually create an index on the id
column to help the later runs. The temp table should likely remain in
memory (temp_buffers) but still, btree is nice for ordered use.

#9Yi Sun
yinan81@gmail.com
In reply to: Kristjan Mustkivi (#7)
Re: Batch update million records in prd DB

Hi Kristjan,

Thank you for this information.

"postgres the memory is slowly eaten away when doing updates within plsql
loop" for this memory issue, I want to check if it exists in our current
postgresql version. And let developer change to use python for loop also
need to show them the proof, how to reproduce and check the memory slowly
eaten away please? OS level cmd or psql cmd to verify? thanks

Kristjan Mustkivi <sonicmonkey@gmail.com> 于2021年3月2日周二 下午5:26写道:

Show quoted text

Hi Yi,

I found that in postgres the memory is slowly eaten away when doing
updates within plsql loop. It only gets released once the whole block
completes. While it is ok for small tables you will eventually run out
of memory for really big ones. The working approach was to do the loop
in e.g a python script that called the DML statements and also called
commit. Several million rows is fortunately relatively small number to
update but once you get to billions this approach would not likely
work. Note that after each batch you also should call VACUUM before
starting a new one to avoid significant table bloat.

BR,

Kristjan

On Wed, Feb 24, 2021 at 3:01 PM Yi Sun <yinan81@gmail.com> wrote:

Hello,

Now need to update several million records in a table in prd DB, if can

use batch update 1000 records and commit each time, if it will affect prd
application like below sample script please?

Sample script:

DO $MAIN$
DECLARE
affect_count integer;
chunk_size CONSTANT integer :=1000;
sleep_sec CONSTANT numeric :=0.1;
BEGIN

loop

exit when affect_count=0;

UPDATE tbl a
SET name = ''
WHERE a.id IN (SELECT id
FROM tbl b
WHERE name IS NULL
LIMIT chunk_size);

GET DIAGNOSTICS affect_count = ROW_COUNT;

commit;

PERFORM pg_sleep(sleep_sec);

end loop;
END;
$MAIN$;

Thanks and best regards

--
Kristjan Mustkivi

Email: kristjan.mustkivi@gmail.com