update faster way
Hello,
We have to update a column value(from numbers like '123' to codes like
'abc' by looking into a reference table data) in a partitioned table with
billions of rows in it, with each partition having 100's millions rows. As
we tested for ~30million rows it's taking ~20minutes to update. So if we go
by this calculation, it's going to take days for updating all the values.
So my question is
1) If there is any inbuilt way of running the update query in parallel
(e.g. using parallel hints etc) to make it run faster?
2) should we run each individual partition in a separate session (e.g. five
partitions will have the updates done at same time from 5 different
sessions)? And will it have any locking effect or we can just start the
sessions and let them run without impacting our live transactions?
UPDATE tab_part1
SET column1 = reftab.code
FROM reference_tab reftab
WHERE tab_part1.column1 = subquery.column1;
Regards
Yudhi
Hii,
On Fri, Sep 13, 2024 at 10:22 PM yudhi s <learnerdatabase99@gmail.com> wrote:
Hello,
We have to update a column value(from numbers like '123' to codes like 'abc' by looking into a reference table data) in a partitioned table with billions of rows in it, with each partition having 100's millions rows. As we tested for ~30million rows it's taking ~20minutes to update. So if we go by this calculation, it's going to take days for updating all the values. So my question is1) If there is any inbuilt way of running the update query in parallel (e.g. using parallel hints etc) to make it run faster?
2) should we run each individual partition in a separate session (e.g. five partitions will have the updates done at same time from 5 different sessions)? And will it have any locking effect or we can just start the sessions and let them run without impacting our live transactions?
Do you have any indexes?
If not - you should, if yes - what are they?
Thank you.
Show quoted text
UPDATE tab_part1
SET column1 = reftab.code
FROM reference_tab reftab
WHERE tab_part1.column1 = subquery.column1;Regards
Yudhi
Do you have any indexes?
If not - you should, if yes - what are they?
Yes we have a primary key on this table which is on a UUID type column and
also we have other indexes in other timestamp columns . But how is this
going to help as we are going to update almost all the rows in the table?
On Sat, 2024-09-14 at 08:43 +0530, yudhi s wrote:
We have to update a column value(from numbers like '123' to codes like 'abc'
by looking into a reference table data) in a partitioned table with billions
of rows in it, with each partition having 100's millions rows. As we tested
for ~30million rows it's taking ~20minutes to update. So if we go by this
calculation, it's going to take days for updating all the values. So my
question is1) If there is any inbuilt way of running the update query in parallel
(e.g. using parallel hints etc) to make it run faster?
2) should we run each individual partition in a separate session (e.g. five
partitions will have the updates done at same time from 5 different
sessions)? And will it have any locking effect or we can just start the
sessions and let them run without impacting our live transactions?
Option 1 doesn't exist.
Option 2 is possible, and you can even have more than one session workingr
on a single partition.
However, the strain on your system's resources and particularly the row
locks will impair normal database work.
Essentially, you can either take an extended down time or perform the updates
in very small chunks with a very low "lock_timeout" over a very long period
of time. If any of the batches fails because of locking conflicts, it has
to be retried.
Investigate with EXPLAIN (ANALYZE) why the updates take that long. It could
be a lame disk, tons of (unnecessary?) indexes or triggers, but it might as
well be the join with the lookup table, so perhaps there is room for
improvement (more "work_mem" for a hash join?).
Yours,
Laurenz Albe
On Sat, 14 Sept, 2024, 1:09 pm Laurenz Albe, <laurenz.albe@cybertec.at>
wrote:
On Sat, 2024-09-14 at 08:43 +0530, yudhi s wrote:
We have to update a column value(from numbers like '123' to codes like
'abc'
by looking into a reference table data) in a partitioned table with
billions
of rows in it, with each partition having 100's millions rows. As we
tested
for ~30million rows it's taking ~20minutes to update. So if we go by this
calculation, it's going to take days for updating all the values. So my
question is1) If there is any inbuilt way of running the update query in parallel
(e.g. using parallel hints etc) to make it run faster?
2) should we run each individual partition in a separate session (e.g.five
partitions will have the updates done at same time from 5 different
sessions)? And will it have any locking effect or we can just startthe
sessions and let them run without impacting our live transactions?
Option 1 doesn't exist.
Option 2 is possible, and you can even have more than one session workingr
on a single partition.However, the strain on your system's resources and particularly the row
locks will impair normal database work.Essentially, you can either take an extended down time or perform the
updates
in very small chunks with a very low "lock_timeout" over a very long period
of time. If any of the batches fails because of locking conflicts, it has
to be retried.Investigate with EXPLAIN (ANALYZE) why the updates take that long. It
could
be a lame disk, tons of (unnecessary?) indexes or triggers, but it might as
well be the join with the lookup table, so perhaps there is room for
improvement (more "work_mem" for a hash join?)
Thank you so much Laurenz.
We have mostly insert/update happen on current day/live partition. So
considering that, if we will run batch updates(with batch size of 1000)
from five different sessions in parallel on different historical partition,
at any time they will lock 5000 rows and then commit. And also those rows
will not collide with each other. So do you think that approach can anyway
cause locking issues? We will ensure the update of live partition occurs
when we have least activity. So in that way we will not need extended down
time. Please correct me if wrong.
Never used lock_timeout though, but in above case do we need lock_timeout?
Regarding batch update with batch size of 1000, do we have any method
exists in postgres (say like forall statement in Oracle) which will do the
batch dml. Can you please guide me here, how we can do it in postgres.
And yes will need to see what happens in the update using explain analyze.
And I was trying to see, if we can run explain analyze without doing actual
update , but seems that is not possible.
Show quoted text
On 2024-09-14 16:10:15 +0530, yudhi s wrote:
On Sat, 14 Sept, 2024, 1:09 pm Laurenz Albe, <laurenz.albe@cybertec.at> wrote:
On Sat, 2024-09-14 at 08:43 +0530, yudhi s wrote:
We have to update a column value(from numbers like '123' to codes like
'abc'
by looking into a reference table data) in a partitioned table with
billions
of rows in it, with each partition having 100's millions rows. As we
tested
for ~30million rows it's taking ~20minutes to update.
[...]
2) should we run each individual partition in a separate session (e.g.
five
partitions will have the updates done at same time from 5 different
sessions)? And will it have any locking effect or we can just startthe
sessions and let them run without impacting our live transactions?
Option 2 is possible, and you can even have more than one session workingr
on a single partition.However, the strain on your system's resources and particularly the row
locks will impair normal database work.Essentially, you can either take an extended down time or perform the
updates
in very small chunks with a very low "lock_timeout" over a very long period
of time. If any of the batches fails because of locking conflicts, it has
to be retried.Investigate with EXPLAIN (ANALYZE) why the updates take that long. It
could
be a lame disk, tons of (unnecessary?) indexes or triggers, but it might as
well be the join with the lookup table, so perhaps there is room for
improvement (more "work_mem" for a hash join?)Thank you so much Laurenz.
We have mostly insert/update happen on current day/live partition. So
considering that, if we will run batch updates(with batch size of 1000) from
five different sessions in parallel on different historical partition, at any
time they will lock 5000 rows and then commit.
If you are updating billions of rows in batches of 5000, that means you
are executing hundreds of thousands or millions of update statements.
Which in turn means that you want as little overhead as possible per
batch which means finding those 5000 rows should be quick. Which brings
us back to Igor's question: Do you have any indexes in place which speed
up finding those 5000 rows (the primary key almost certainly won't help
with that). EXPLAIN (ANALYZE) (as suggested by Laurenz) will certainly
help answering that question.
And also those rows will not collide with each other. So do you think
that approach can anyway cause locking issues?
No, I don't think so. With a batch size that small I wouldn't expect
problems even on the live partition. But of course many busy parallel
sessions will put additional load on the system which may or may not be
noticeable by users (you might saturate the disks writing WAL entries
for example, which would slow down other sessions trying to commit).
Regarding batch update with batch size of 1000, do we have any method exists in
postgres (say like forall statement in Oracle) which will do the batch dml. Can
you please guide me here, how we can do it in postgres.
Postgres offers several server side languages. As an Oracle admin you
will probably find PL/pgSQL most familiar. But you could also use Perl
or Python or several others. And of course you could use any
programming/scripting language you like on the client side.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
On Fri, Sep 13, 2024 at 11:59 PM yudhi s <learnerdatabase99@gmail.com>
wrote:
Do you have any indexes?
If not - you should, if yes - what are they?Yes we have a primary key on this table which is on a UUID type column and
also we have other indexes in other timestamp columns . But how is this
going to help as we are going to update almost all the rows in the table?
But do you have an index on tab_part1.column1?
And how slow is subquery?
On Sat, Sep 14, 2024 at 4:55 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
Which in turn means that you want as little overhead as possible per
batch which means finding those 5000 rows should be quick. Which brings
us back to Igor's question: Do you have any indexes in place which speed
up finding those 5000 rows (the primary key almost certainly won't help
with that). EXPLAIN (ANALYZE) (as suggested by Laurenz) will certainly
help answering that question.And also those rows will not collide with each other. So do you think
that approach can anyway cause locking issues?No, I don't think so. With a batch size that small I wouldn't expect
problems even on the live partition. But of course many busy parallel
sessions will put additional load on the system which may or may not be
noticeable by users (you might saturate the disks writing WAL entries
for example, which would slow down other sessions trying to commit).Regarding batch update with batch size of 1000, do we have any method
exists in
postgres (say like forall statement in Oracle) which will do the batch
dml. Can
you please guide me here, how we can do it in postgres.
Postgres offers several server side languages. As an Oracle admin you
will probably find PL/pgSQL most familiar. But you could also use Perl
or Python or several others. And of course you could use any
programming/scripting language you like on the client side.
When you said *"(the primary key almost certainly won't help with that)", *I
am trying to understand why it is so ?
I was thinking of using that column as an incrementing filter and driving
the eligible rows based on that filter. And if it would have been a
sequence. I think it would have helped but in this case it's UUID , so I
may not be able to do the batch DML using that as filter criteria. but in
that case will it be fine to drive the update based on ctid something as
below? Each session will have the range of 5 days of data or five partition
data and will execute a query something as below which will update in the
batches of 10K and then commit. Is this fine? Or is there some better way
of doing the batch DML in postgres plpgsql?
DO $$
DECLARE
l_rowid_array ctid[];
l_ctid ctid;
l_array_size INT := 10000;
l_processed INT := 0;
BEGIN
FOR l_cnt IN 0..(SELECT COUNT(*) FROM part_tab WHERE part_date >
'1-sep-2024' and part_date < '5-sep-2024'
) / l_array_size LOOP
l_rowid_array := ARRAY(
SELECT ctid
FROM part_tab
WHERE part_date > '1-sep-2024' and part_date < '5-sep-2024'
LIMIT l_array_size OFFSET l_cnt * l_array_size
);
FOREACH l_ctid IN ARRAY l_rowid_array LOOP
update part_tab
SET column1 = reftab.code
FROM reference_tab reftab
WHERE tab_part1.column1 = reftab.column1
and ctid = l_ctid;
l_processed := l_processed + 1;
END LOOP;
COMMIT;
END LOOP;
END $$;
On Sat, 2024-09-14 at 16:10 +0530, yudhi s wrote:
However, the strain on your system's resources and particularly the row
locks will impair normal database work.Essentially, you can either take an extended down time or perform the updates
in very small chunks with a very low "lock_timeout" over a very long period
of time. If any of the batches fails because of locking conflicts, it has
to be retried.Investigate with EXPLAIN (ANALYZE) why the updates take that long. It could
be a lame disk, tons of (unnecessary?) indexes or triggers, but it might as
well be the join with the lookup table, so perhaps there is room for
improvement (more "work_mem" for a hash join?)We have mostly insert/update happen on current day/live partition. So
considering that, if we will run batch updates(with batch size of 1000) from
five different sessions in parallel on different historical partition, at any
time they will lock 5000 rows and then commit. And also those rows will not
collide with each other. So do you think that approach can anyway cause locking
issues?
The updates won't lock with each other. I thought that other database activity
might modify rows in these partitions. If that is not the case, you don't need
to worry about locks.
In that case I would also choose a much higher batch size.
You should make sure to back off every now and then and VACUUM the partition,
so that you avoid excessive table bloat.
We will ensure the update of live partition occurs when we have least activity.
So in that way we will not need extended down time. Please correct me if wrong.
That sounds right.
Never used lock_timeout though, but in above case do we need lock_timeout?
It can be useful if your updating process is blocked by a lock from the
application. Setting the parameter to a low value will keep your update
from hanging for a long time and will throw an error instead.
Erroring out early reduces the danger of a deadlock.
Regarding batch update with batch size of 1000, do we have any method exists
in postgres (say like forall statement in Oracle) which will do the batch dml.
Can you please guide me here, how we can do it in postgres.
I would certainly not perform the update row for row in PL/pgSQL code.
Perhaps something like this:
DO
$$DECLARE
i bigint;
BEGIN
FOR i IN 1..1000000 by 100000 LOOP
UPDATE tab SET ...
WHERE id >= i AND id < i + 100000;
COMMIT;
END LOOP;
END;$$;
VACUUM tab;
Then repeat for the next million rows, and so on.
And yes will need to see what happens in the update using explain analyze.
And I was trying to see, if we can run explain analyze without doing
actual update , but seems that is not possible.
You can do it in a transaction and roll the transaction back.
Yours,
Laurenz Albe
The only way that I see as plausible to use a subquery, both in the query
and in the setting of the variable, is that the relationship is one to one,
and that there is an index that responds to the predicate
UPDATE table1 t1
SET column_value = (SELECT <value> FROM table2 t2 WHERE t2.column_relation
= t1.column_relation)
WHERE (colum_relation) IN (SELECT column_relation FROM table2)
PD: the index of being in table2
Atte
JRBM
El sáb, 14 sept 2024 a las 0:22, yudhi s (<learnerdatabase99@gmail.com>)
escribió:
Show quoted text
Hello,
We have to update a column value(from numbers like '123' to codes like
'abc' by looking into a reference table data) in a partitioned table with
billions of rows in it, with each partition having 100's millions rows. As
we tested for ~30million rows it's taking ~20minutes to update. So if we go
by this calculation, it's going to take days for updating all the values.
So my question is1) If there is any inbuilt way of running the update query in parallel
(e.g. using parallel hints etc) to make it run faster?
2) should we run each individual partition in a separate session (e.g.
five partitions will have the updates done at same time from 5 different
sessions)? And will it have any locking effect or we can just start the
sessions and let them run without impacting our live transactions?UPDATE tab_part1
SET column1 = reftab.code
FROM reference_tab reftab
WHERE tab_part1.column1 = subquery.column1;Regards
Yudhi
On 2024-09-14 20:26:32 +0530, yudhi s wrote:
On Sat, Sep 14, 2024 at 4:55 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
Which in turn means that you want as little overhead as possible per
batch which means finding those 5000 rows should be quick. Which brings
us back to Igor's question: Do you have any indexes in place which speed
up finding those 5000 rows (the primary key almost certainly won't help
with that). EXPLAIN (ANALYZE) (as suggested by Laurenz) will certainly
help answering that question.And also those rows will not collide with each other. So do you think
that approach can anyway cause locking issues?No, I don't think so. With a batch size that small I wouldn't expect
problems even on the live partition. But of course many busy parallel
sessions will put additional load on the system which may or may not be
noticeable by users (you might saturate the disks writing WAL entries
for example, which would slow down other sessions trying to commit).Regarding batch update with batch size of 1000, do we have any method
exists in
postgres (say like forall statement in Oracle) which will do the batch
dml. Can
you please guide me here, how we can do it in postgres.
Postgres offers several server side languages. As an Oracle admin you
will probably find PL/pgSQL most familiar. But you could also use Perl
or Python or several others. And of course you could use any
programming/scripting language you like on the client side.When you said "(the primary key almost certainly won't help with that)", I am
trying to understand why it is so ?
I was thinking that you would do something like
begin;
update with a as (
select id from the_table
where :part_lower <= id and id < :part_upper and col_x = :old
limit 5000
)
update the_table set col_x = :new
from a where the_table.id = a.id;
commit;
in a loop until you you update 0 rows and then switch to the next
partition. That pretty much requires an index on col_x or you will need
a sequential scan to find the next 5000 rows to update.
Even if you return the ids and leed the last updated id back into the
loop like this:
update with a as (
select id from the_table
where id > :n and col_x = :old
order by id
limit 5000
)
update the_table set col_x = :new
from a where the_table.id = a.id;
that may lead to a lot of extra reads from the heap or the optimizer
might even decide it's better to go for a sequential scan.
The latter is pretty unlikely if you restrict the range of ids:
update the_table set col_x = :new
where :n <= id and id < :n + 5000 and col_x = :old;
but that will possible result in a lot of queries which don't update
anything at all but still need to read 5000 rows each.
I was thinking of using that column as an incrementing filter and driving the
eligible rows based on that filter. And if it would have been a sequence. I
think it would have helped but in this case it's UUID , so I may not be able to
do the batch DML using that as filter criteria.
You can order by uuid or compare them to other uuids. So my first two
approaches above would still work.
but in that case will it be fine to drive the update based on ctid
something as below? Each session will have the range of 5 days of data
or five partition data and will execute a query something as below
which will update in the batches of 10K and then commit. Is this fine?
Or is there some better way of doing the batch DML in postgres
plpgsql?DO $$
DECLARE
l_rowid_array ctid[];
l_ctid ctid;
l_array_size INT := 10000;
l_processed INT := 0;
BEGIN
FOR l_cnt IN 0..(SELECT COUNT(*) FROM part_tab WHERE part_date >
'1-sep-2024' and part_date < '5-sep-2024'
) / l_array_size LOOP
l_rowid_array := ARRAY(
SELECT ctid
FROM part_tab
WHERE part_date > '1-sep-2024' and part_date < '5-sep-2024'
LIMIT l_array_size OFFSET l_cnt * l_array_size
Never use LIMIT and OFFSET without an ORDER BY, especially not when you
are updating the table. You may get some rows twice and some never.
ALso OFFSET means you are reading all those rows and then ignoring
them. I expect this to be O(n²).
);
FOREACH l_ctid IN ARRAY l_rowid_array LOOP
update part_tab
SET column1 = reftab.code
FROM reference_tab reftab
I see you are using a reference table and I think you mentioned that you
will be updating most rows. So that alleviates my concerns that you may
read lots of rows without updating them. But you still need an efficient
way to get at the next rows to update.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Hi,
You can solve this problem using Citus in PostgreSQL, which is specifically
designed for parallelism
SELECT create_distributed_table('tab_part1', 'partition_key');
SELECT create_distributed_table('reference_tab', 'reference_key');
UPDATE tab_part1
SET column1 = reftab.code
FROM reference_tab reftab
WHERE tab_part1.column1 = reftab.column1;
On Sat, 14 Sept 2024 at 08:22, yudhi s <learnerdatabase99@gmail.com> wrote:
Show quoted text
Hello,
We have to update a column value(from numbers like '123' to codes like
'abc' by looking into a reference table data) in a partitioned table with
billions of rows in it, with each partition having 100's millions rows. As
we tested for ~30million rows it's taking ~20minutes to update. So if we go
by this calculation, it's going to take days for updating all the values.
So my question is1) If there is any inbuilt way of running the update query in parallel
(e.g. using parallel hints etc) to make it run faster?
2) should we run each individual partition in a separate session (e.g.
five partitions will have the updates done at same time from 5 different
sessions)? And will it have any locking effect or we can just start the
sessions and let them run without impacting our live transactions?UPDATE tab_part1
SET column1 = reftab.code
FROM reference_tab reftab
WHERE tab_part1.column1 = subquery.column1;Regards
Yudhi
On 2024-Sep-14, yudhi s wrote:
Hello,
We have to update a column value(from numbers like '123' to codes like
'abc' by looking into a reference table data) in a partitioned table with
billions of rows in it, with each partition having 100's millions rows.
Another option is to not update anything, and instead create a view on
top of the partitioned table (joined to the reference table) that
returns the reference value instead of the original number value from
the column; when the application wants to receive those reference
values, it queries the view instead of the partitioned table directly.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)