UPDATE many records
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).
Given that each row update is completely independent of any other row, I have the following questions:
1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
There are several ways to actually do this
If you have Postgresql 11 or higher we now have Create Procedure that
allows committing transactions, one draw back is it can not parallel from
inside the procedure
https://www.postgresql.org/docs/11/sql-createprocedure.html
https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11
If its an older version then Python Script or other scripting language to
iterates over the data say 10 to 20K will do what you want
for i in list of IDs
begin ;
"UPDATE table_name SET changed_field=new_value() where ID @> int4range(i,
i+10000);
commit;
To create parallel process simple Python script or other scripting language
can be used to create many connections working the data in parallel but
given the simple update it will NOT help in performance, this will be
Hard disk IO bound, not process bound where parallelization helps
On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster <ijbrewster@alaska.edu>
wrote:
Show quoted text
Thanks to a change in historical data, I have a need to update a large
number of records (around 50 million). The update itself is straight
forward, as I can just issue an "UPDATE table_name SET
changed_field=new_value();" (yes, new_value is the result of a stored
procedure, if that makes a difference) command via psql, and it should
work. However, due to the large number of records this command will
obviously take a while, and if anything goes wrong during the update (one
bad value in row 45 million, lost connection, etc), all the work that has
been done already will be lost due to the transactional nature of such
commands (unless I am missing something).Given that each row update is completely independent of any other row, I
have the following questions:1) Is there any way to set the command such that each row change is
committed as it is calculated?
2) Is there some way to run this command in parallel in order to better
utilize multiple processor cores, other than manually breaking the data
into chunks and running a separate psql/update process for each chunk?
Honestly, manual parallelizing wouldn’t be too bad (there are a number of
logical segregations I can apply), I’m just wondering if there is a more
automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
On 1/6/20 10:36 AM, Israel Brewster wrote:
Thanks to a change in historical data, I have a need to update a large
number of records (around 50 million). The update itself is straight
forward, as I can just issue an "UPDATE table_name SET
changed_field=new_value();" (yes, new_value is the result of a stored
procedure, if that makes a difference) command via psql, and it should
work. However, due to the large number of records this command will
obviously take a while, and if anything goes wrong during the update
(one bad value in row 45 million, lost connection, etc), all the work
that has been done already will be lost due to the transactional nature
of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I
have the following questions:1) Is there any way to set the command such that each row change is
committed as it is calculated?
Pretty sure:
UPDATE table_name SET changed_field=new_value();
is seen as a single statement and is all or none.
If you want to go row by row you will need to have the statement run on
a row by row basis or maybe in batches.
2) Is there some way to run this command in parallel in order to better
utilize multiple processor cores, other than manually breaking the data
into chunks and running a separate psql/update process for each chunk?
Honestly, manual parallelizing wouldn’t be too bad (there are a number
of logical segregations I can apply), I’m just wondering if there is a
more automatic option.
This is good time to ask what Postgres version?
I am still working out the recent parallel query system additions. Not
sure if it applies to UPDATE or not.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
--
Adrian Klaver
adrian.klaver@aklaver.com
On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu> wrote:
Thanks to a change in historical data, I have a need to update a large
number of records (around 50 million). The update itself is straight
forward, as I can just issue an "UPDATE table_name SET
changed_field=new_value();" (yes, new_value is the result of a stored
procedure, if that makes a difference) command via psql, and it should
work. However, due to the large number of records this command will
obviously take a while, and if anything goes wrong during the update (one
bad value in row 45 million, lost connection, etc), all the work that has
been done already will be lost due to the transactional nature of such
commands (unless I am missing something).Given that each row update is completely independent of any other row, I
have the following questions:1) Is there any way to set the command such that each row change is
committed as it is calculated?
2) Is there some way to run this command in parallel in order to better
utilize multiple processor cores, other than manually breaking the data
into chunks and running a separate psql/update process for each chunk?
Honestly, manual parallelizing wouldn’t be too bad (there are a number of
logical segregations I can apply), I’m just wondering if there is a more
automatic option.
Yeah, I'd be inclined to do this in batches.
If, for instance, the table has a nice primary key, then I'd capture the
primary keys into a side table, and grab tuples from the side table to
process in more bite-sized batches, say, of a few thousand tuples per batch.
create table just_keys as select pk_column from big_historical_table;
alter table just_keys add column processed boolean;
create index jkpk on just_keys(pk_column) where (processed is null);
then loop repeatedly along the lines...
create temp table iteration as select pk_column from just_keys where
processed is null limit 1000;
[do update on big_historical_table where pk_column in (select pk_column
from iteration)]
update iteration set processed='true' where pk_column in (select pk_column
from iteration);
drop table iteration;
Parallelization is absolutely an interesting idea; if you want to use 8
processes, then use a cycling sequence on the side table to spread tuples
across the 8 processes, so that they can grab their own tuples and not
block one another.
In that case, more like...
create temp sequence seq_procs start with 1 maxval 8 cycle;
create temp table just_keys as select pk_column, false::boolean as
processed, nextval('seq_procs') as batch_id from big_historical_table;
The individual iterations then look for values in just_keys corresponding
to their assigned batch number.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
Good information. I did forget to mention that I am using PostgreSQL 11.5. I also was not aware of the distinction between PROCEDURE and FUNCTION, so I guess I used the wrong terminology there when stating that new_value is the result of a stored procedure. It’s actually a function.
So would your suggestion then be to create a procedure that loops through the records, calculating and committing each one (or, as in your older Postgres example, batches of 10k to 20k)?
Good point on the HD I/O bound vs processor bound, but wouldn’t that depend on how complicated the actual update is? Still, there is a good chance you are correct in that statement, so that aspect is probably not worth spending too much time on.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
Show quoted text
On Jan 6, 2020, at 10:05 AM, Justin <zzzzz.graf@gmail.com> wrote:
There are several ways to actually do this
If you have Postgresql 11 or higher we now have Create Procedure that allows committing transactions, one draw back is it can not parallel from inside the procedure
https://www.postgresql.org/docs/11/sql-createprocedure.html <https://www.postgresql.org/docs/11/sql-createprocedure.html>
https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11 <https://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11>If its an older version then Python Script or other scripting language to iterates over the data say 10 to 20K will do what you want
for i in list of IDs
begin ;
"UPDATE table_name SET changed_field=new_value() where ID @> int4range(i, i+10000);
commit;To create parallel process simple Python script or other scripting language can be used to create many connections working the data in parallel but given the simple update it will NOT help in performance, this will be Hard disk IO bound, not process bound where parallelization helps
On Mon, Jan 6, 2020 at 1:36 PM Israel Brewster <ijbrewster@alaska.edu <mailto:ijbrewster@alaska.edu>> wrote:
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:
1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne@gmail.com> wrote:
On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu <mailto:ijbrewster@alaska.edu>> wrote:
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:
1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.Yeah, I'd be inclined to do this in batches.
If, for instance, the table has a nice primary key, then I'd capture the primary keys into a side table, and grab tuples from the side table to process in more bite-sized batches, say, of a few thousand tuples per batch.
create table just_keys as select pk_column from big_historical_table;
alter table just_keys add column processed boolean;
create index jkpk on just_keys(pk_column) where (processed is null);
then loop repeatedly along the lines...create temp table iteration as select pk_column from just_keys where processed is null limit 1000;
[do update on big_historical_table where pk_column in (select pk_column from iteration)]
update iteration set processed='true' where pk_column in (select pk_column from iteration);
drop table iteration;Parallelization is absolutely an interesting idea; if you want to use 8 processes, then use a cycling sequence on the side table to spread tuples across the 8 processes, so that they can grab their own tuples and not block one another.
In that case, more like...
create temp sequence seq_procs start with 1 maxval 8 cycle;
create temp table just_keys as select pk_column, false::boolean as processed, nextval('seq_procs') as batch_id from big_historical_table;The individual iterations then look for values in just_keys corresponding to their assigned batch number.
Sounds like a reasonable approach. As Justin pointed out, it is actually likely that the process will be IO bound rather than CPU bound, so my parallel idea may not have much merit after all, but the batching procedure makes sense. I assume you meant update just_keys in your sample rather than update iteration on that line just before drop table iteration. Thanks for the info!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
Show quoted text
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
As you have access to Procedure, you can create a loop then issue an
Begin Update Commit
so something like this should work plpgsql
declare
icount int = 0;
new_count int = 0;
begin
select count(*) into icount from mytable;
loop
begin ;
Update mytable set myvalue = newvalue() where id between new_count
and new_count+9999 ;
commit;
new_count = new_count + 10,000;
if new_count > icount then
break
end if;
end loop;
end;
I am going to put caveat into this, if newvalue() function is complex and
takes allot of "CPU cycles to do its thing" then parallelism would help,
unless this function looks at the table being updated it can really
complicate things as the parallel functions would be looking at stale
records which could be bad...
On Mon, Jan 6, 2020 at 3:07 PM Israel Brewster <ijbrewster@alaska.edu>
wrote:
Show quoted text
Good information. I did forget to mention that I am using PostgreSQL 11.5.
I also was not aware of the distinction between PROCEDURE and FUNCTION, so
I guess I used the wrong terminology there when stating that new_value is
the result of a stored procedure. It’s actually a function.So would your suggestion then be to create a procedure that loops through
the records, calculating and committing each one (or, as in your older
Postgres example, batches of 10k to 20k)?Good point on the HD I/O bound vs processor bound, but wouldn’t that
depend on how complicated the actual update is? Still, there is a good
chance you are correct in that statement, so that aspect is probably not
worth spending too much time on.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145On Jan 6, 2020, at 10:05 AM, Justin <zzzzz.graf@gmail.com> wrote:
There are several ways to actually do this
If you have Postgresql 11 or higher we now have Create Procedure that
allows committing transactions, one draw back is it can not parallel from
inside the procedure
https://www.postgresql.org/docs/11/sql-createprocedure.htmlhttps://severalnines.com/database-blog/overview-new-stored-procedures-postgresql-11
If its an older version then Python Script or other scripting language to
iterates over the data say 10 to 20K will do what you wantfor i in list of IDs
begin ;
"UPDATE table_name SET changed_field=new_value() where ID @> int4range(i,
i+10000);
commit;To create parallel process simple Python script or other scripting
language can be used to create many connections working the data in
parallel but given the simple update it will NOT help in performance,
this will be Hard disk IO bound, not process bound where parallelization
helpsOn Mon, Jan 6, 2020 at 1:36 PM Israel Brewster <ijbrewster@alaska.edu>
wrote:Thanks to a change in historical data, I have a need to update a large
number of records (around 50 million). The update itself is straight
forward, as I can just issue an "UPDATE table_name SET
changed_field=new_value();" (yes, new_value is the result of a stored
procedure, if that makes a difference) command via psql, and it should
work. However, due to the large number of records this command will
obviously take a while, and if anything goes wrong during the update (one
bad value in row 45 million, lost connection, etc), all the work that has
been done already will be lost due to the transactional nature of such
commands (unless I am missing something).Given that each row update is completely independent of any other row, I
have the following questions:1) Is there any way to set the command such that each row change is
committed as it is calculated?
2) Is there some way to run this command in parallel in order to better
utilize multiple processor cores, other than manually breaking the data
into chunks and running a separate psql/update process for each chunk?
Honestly, manual parallelizing wouldn’t be too bad (there are a number of
logical segregations I can apply), I’m just wondering if there is a more
automatic option.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
On 6 Jan 2020, at 21:15, Israel Brewster <ijbrewster@alaska.edu> wrote:
On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne@gmail.com> wrote:
On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu> wrote:
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:
1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.Yeah, I'd be inclined to do this in batches.
I think you’re overcomplicating the matter.
I’d just do it as a single update in one transaction. It’s only 50M rows. It may take half an hour or so on decent hardware, depending on how resource-intensive your function is.
If that fails[1], only then would I start looking into batching things. But then you still need to figure out why it fails and what to do about that; if it fails it will probably fail fast, and if not, then you’re looking at a one-off situation that won’t require more than a few workarounds - after which you can just run the update again.
Ad 1). No harm has been done, it’s a single transaction that rolled back.
Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.
On Mon, Jan 6, 2020, 3:15 PM Israel Brewster <ijbrewster@alaska.edu> wrote:
On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne@gmail.com>
wrote:On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu>
wrote:Thanks to a change in historical data, I have a need to update a large
number of records (around 50 million). The update itself is straight
forward, as I can just issue an "UPDATE table_name SET
changed_field=new_value();" (yes, new_value is the result of a stored
procedure, if that makes a difference) command via psql, and it should
work. However, due to the large number of records this command will
obviously take a while, and if anything goes wrong during the update (one
bad value in row 45 million, lost connection, etc), all the work that has
been done already will be lost due to the transactional nature of such
commands (unless I am missing something).Given that each row update is completely independent of any other row, I
have the following questions:1) Is there any way to set the command such that each row change is
committed as it is calculated?
2) Is there some way to run this command in parallel in order to better
utilize multiple processor cores, other than manually breaking the data
into chunks and running a separate psql/update process for each chunk?
Honestly, manual parallelizing wouldn’t be too bad (there are a number of
logical segregations I can apply), I’m just wondering if there is a more
automatic option.Yeah, I'd be inclined to do this in batches.
If, for instance, the table has a nice primary key, then I'd capture the
primary keys into a side table, and grab tuples from the side table to
process in more bite-sized batches, say, of a few thousand tuples per batch.create table just_keys as select pk_column from big_historical_table;
alter table just_keys add column processed boolean;
create index jkpk on just_keys(pk_column) where (processed is null);
then loop repeatedly along the lines...create temp table iteration as select pk_column from just_keys where
processed is null limit 1000;
[do update on big_historical_table where pk_column in (select pk_column
from iteration)]
update iteration set processed='true' where pk_column in (select pk_column
from iteration);
drop table iteration;Parallelization is absolutely an interesting idea; if you want to use 8
processes, then use a cycling sequence on the side table to spread tuples
across the 8 processes, so that they can grab their own tuples and not
block one another.In that case, more like...
create temp sequence seq_procs start with 1 maxval 8 cycle;
create temp table just_keys as select pk_column, false::boolean as
processed, nextval('seq_procs') as batch_id from big_historical_table;The individual iterations then look for values in just_keys corresponding
to their assigned batch number.Sounds like a reasonable approach. As Justin pointed out, it is actually
likely that the process will be IO bound rather than CPU bound, so my
parallel idea may not have much merit after all, but the batching procedure
makes sense. I assume you meant update just_keys in your sample rather than
update iteration on that line just before drop table iteration. Thanks for
the info
As for parallelism, if you have really powerful disk, lots of disks on disk
array, it may help. Or not, as commented.
I didn't test my wee bit of code, so yep, I meant to update just_keys :-).
You won't find something terribly much more automatic.
Oh, yah, there's a possible further complication; does the application need
to get stopped to do this update? Is the newest version of the app still
generating data that needs the rewriting? Sure hope not...
On Jan 6, 2020, at 1:29 PM, Alban Hertroys <haramrae@gmail.com> wrote:
I think you’re overcomplicating the matter.
I’d just do it as a single update in one transaction. It’s only 50M rows. It may take half an hour or so on decent hardware, depending on how resource-intensive your function is.
I must emphasize: This estimate is HIGHLY dependent on hardware and the complexity of the table (number of indices, etc). (I suspect there’s a correlation between table size (business value) and number of indices)
Show quoted text
If that fails[1], only then would I start looking into batching things. But then you still need to figure out why it fails and what to do about that; if it fails it will probably fail fast, and if not, then you’re looking at a one-off situation that won’t require more than a few workarounds - after which you can just run the update again.
Ad 1). No harm has been done, it’s a single transaction that rolled back.
Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.
On Jan 6, 2020, at 11:38 AM, Christopher Browne <cbbrowne@gmail.com> wrote:
On Mon, Jan 6, 2020, 3:15 PM Israel Brewster <ijbrewster@alaska.edu <mailto:ijbrewster@alaska.edu>> wrote:
On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne@gmail.com <mailto:cbbrowne@gmail.com>> wrote:
On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster@alaska.edu <mailto:ijbrewster@alaska.edu>> wrote:
Thanks to a change in historical data, I have a need to update a large number of records (around 50 million). The update itself is straight forward, as I can just issue an "UPDATE table_name SET changed_field=new_value();" (yes, new_value is the result of a stored procedure, if that makes a difference) command via psql, and it should work. However, due to the large number of records this command will obviously take a while, and if anything goes wrong during the update (one bad value in row 45 million, lost connection, etc), all the work that has been done already will be lost due to the transactional nature of such commands (unless I am missing something).Given that each row update is completely independent of any other row, I have the following questions:
1) Is there any way to set the command such that each row change is committed as it is calculated?
2) Is there some way to run this command in parallel in order to better utilize multiple processor cores, other than manually breaking the data into chunks and running a separate psql/update process for each chunk? Honestly, manual parallelizing wouldn’t be too bad (there are a number of logical segregations I can apply), I’m just wondering if there is a more automatic option.Yeah, I'd be inclined to do this in batches.
If, for instance, the table has a nice primary key, then I'd capture the primary keys into a side table, and grab tuples from the side table to process in more bite-sized batches, say, of a few thousand tuples per batch.
create table just_keys as select pk_column from big_historical_table;
alter table just_keys add column processed boolean;
create index jkpk on just_keys(pk_column) where (processed is null);
then loop repeatedly along the lines...create temp table iteration as select pk_column from just_keys where processed is null limit 1000;
[do update on big_historical_table where pk_column in (select pk_column from iteration)]
update iteration set processed='true' where pk_column in (select pk_column from iteration);
drop table iteration;Parallelization is absolutely an interesting idea; if you want to use 8 processes, then use a cycling sequence on the side table to spread tuples across the 8 processes, so that they can grab their own tuples and not block one another.
In that case, more like...
create temp sequence seq_procs start with 1 maxval 8 cycle;
create temp table just_keys as select pk_column, false::boolean as processed, nextval('seq_procs') as batch_id from big_historical_table;The individual iterations then look for values in just_keys corresponding to their assigned batch number.
Sounds like a reasonable approach. As Justin pointed out, it is actually likely that the process will be IO bound rather than CPU bound, so my parallel idea may not have much merit after all, but the batching procedure makes sense. I assume you meant update just_keys in your sample rather than update iteration on that line just before drop table iteration. Thanks for the info
As for parallelism, if you have really powerful disk, lots of disks on disk array, it may help. Or not, as commented.
I didn't test my wee bit of code, so yep, I meant to update just_keys :-).
You won't find something terribly much more automatic.
Oh, yah, there's a possible further complication; does the application need to get stopped to do this update? Is the newest version of the app still generating data that needs the rewriting? Sure hope not…
Yeah, a valid concern, but I should be ok on that front. Once I fix the calculation function, any new records will have the correct value. Plus, the actual update calculation is idempotent, so if a handful of new records end up getting re-calculated, that’s not an issue. Granted, the data will look weird while the re-calculation is in process (part new, part old), but we can live with that :-)
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
On Jan 6, 2020, at 11:40 AM, Rob Sargent <robjsargent@gmail.com> wrote:
On Jan 6, 2020, at 1:29 PM, Alban Hertroys <haramrae@gmail.com <mailto:haramrae@gmail.com>> wrote:
I think you’re overcomplicating the matter.
I’d just do it as a single update in one transaction. It’s only 50M rows. It may take half an hour or so on decent hardware, depending on how resource-intensive your function is.
I must emphasize: This estimate is HIGHLY dependent on hardware and the complexity of the table (number of indices, etc). (I suspect there’s a correlation between table size (business value) and number of indices)
I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
Show quoted text
If that fails[1], only then would I start looking into batching things. But then you still need to figure out why it fails and what to do about that; if it fails it will probably fail fast, and if not, then you’re looking at a one-off situation that won’t require more than a few workarounds - after which you can just run the update again.
Ad 1). No harm has been done, it’s a single transaction that rolled back.
Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.
I’m thinking it might be worth it to do a “quick” test on 1,000 or so
records (or whatever number can run in a minute or so), watching the
processor utilization as it runs. That should give me a better feel for
where the bottlenecks may be, and how long the entire update process would
take. I’m assuming, of course, that the total time would scale more or less
linearly with the number of records.
I think that depends on how your identify and limit the update to those
1000 records. If it is using a primary key with specific keys in an array,
probably close to linear increase because the where clause isn't impactful
to the overall execution time. If you write a sub-query that is slow, then
you would need to exclude that from the time. You can always run explain
analyze on the update and rollback rather than commit.
On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:
I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.
I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn <https://explain.depesz.com/s/lIYn> if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.
One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
Just out of curiosity, what kind of trigger are you using, a row level trigger or a statement level trigger? If you are using a row level trigger, see if you can achieve your requirements using a statement level trigger instead. I’m relatively new to Postgres, so there could be some limit that I’m not aware of, but my understanding is that you have access to the old and new values of the updated rows in the after statement trigger. It would likely be much more performant to do your operation once after the statement is done rather than firing a trigger on every changed row.
Regards,
Mark Z.
From: Israel Brewster <ijbrewster@alaska.edu>
Sent: Monday, January 6, 2020 1:24 PM
To: Michael Lewis <mlewis@entrata.com>
Cc: Rob Sargent <robjsargent@gmail.com>; Alban Hertroys <haramrae@gmail.com>; Christopher Browne <cbbrowne@gmail.com>; pgsql-generallists.postgresql.org <pgsql-general@lists.postgresql.org>
Subject: Re: UPDATE many records
On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com<mailto:mlewis@entrata.com>> wrote:
I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.
I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.
One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
On Mon, Jan 6, 2020 at 2:34 PM Mark Zellers <markz@adaptiveinsights.com>
wrote:
Just out of curiosity, what kind of trigger are you using, a row level
trigger or a statement level trigger? If you are using a row level
trigger, see if you can achieve your requirements using a statement level
trigger instead. I’m relatively new to Postgres, so there could be some
limit that I’m not aware of, but my understanding is that you have access
to the old and new values of the updated rows in the after statement
trigger. It would likely be much more performant to do your operation once
after the statement is done rather than firing a trigger on every changed
row.
My experience/understanding is that statement level triggers can be a big
performance boost, but only for changing *other* tables and not the table
that the trigger is on since it is *AFTER* only and can't modify NEW record
directly.
What was the HD wait time ? What tool is being use to monitor the server
resources??
It appears based on this information there is allot more going on than a
simple Update command
Moving code out of the trigger probably not going to improve performance,
unless there is allot of code that does not need to be processed for this
update or code touching other tables
Study the trigger identify what has to run, pull that code out, then
disable the trigger. Move the necessary code to a new function for
Updating..
On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster <ijbrewster@alaska.edu>
wrote:
Show quoted text
On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:
I’m thinking it might be worth it to do a “quick” test on 1,000 or so
records (or whatever number can run in a minute or so), watching the
processor utilization as it runs. That should give me a better feel for
where the bottlenecks may be, and how long the entire update process would
take. I’m assuming, of course, that the total time would scale more or less
linearly with the number of records.I think that depends on how your identify and limit the update to those
1000 records. If it is using a primary key with specific keys in an array,
probably close to linear increase because the where clause isn't impactful
to the overall execution time. If you write a sub-query that is slow, then
you would need to exclude that from the time. You can always run explain
analyze on the update and rollback rather than commit.So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN
ANALYZE output at https://explain.depesz.com/s/lIYn if it matters),
during which time I did see a postmaster process consuming 100% CPU. Upping
the test to 20,819 records took about 16.5 seconds, so that looks
relatively linear to me. Also, CPU bound. So by my calculations, doing all
50M records would take around 10 hours.One potentially significant note: most of the execution time is spent in a
trigger. This trigger is actually what’s doing the REAL update that I need
to happen. If it would make a difference, I could easily pull the trigger
code out to a separate function that I just call directly (with triggers
temporarily disabled). My thinking is that calling a function is calling a
function, and the fact that it is currently called via a trigger rather
than direct is of little consequence, but I’m willing to be corrected on
that :-)---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
On Jan 6, 2020, at 12:49 PM, Justin <zzzzz.graf@gmail.com> wrote:
What was the HD wait time ? What tool is being use to monitor the server resources??
No idea on the HD wait time - how would I best monitor that? That said, this machine does have NVMe drives, so the speed should be fairly high/wait time fairly low. It’s also running as a VM, which could affect things, but is a bit of a moot point as far as this update goes. As far as monitoring server resources, I was just using top.
It appears based on this information there is allot more going on than a simple Update command
Depending on your definition of “simple update” of course, very true. As I stated in the original message, the actual update value is the result of a function. The psql command is a simple update, but the function does a bit of stuff (primarily trigonometry). According to the EXPLAIN ANALYZE, about .7 ms of stuff per record, which of course is most of the runtime. It is entirely possible that the function could be optimized to run more quickly.
Moving code out of the trigger probably not going to improve performance, unless there is allot of code that does not need to be processed for this update or code touching other tables
One SELECT query on another table to get some values I need to use for the calculation. No code that is not needed for the update. Given the nature of this bulk update, I *could* make a separate function that simply takes those values as parameters, since the same value will be applied to a lot of records. I’d just have to be careful about how I applied the update, so rows get processed with the correct values. I’m not convinced it would be worth it though - might shave a few hours off the total execution time (assuming that SELECT is expensive - EXPLAIN ANLYZE shows an index scan, on a table with only 12,761 rows, which seems to be about as simple as it gets), but I doubt it would be enough for me to feel comfortable simply running the update as one monolithic unit.
Study the trigger identify what has to run, pull that code out, then disable the trigger. Move the necessary code to a new function for Updating..
Sure. But I feel we are getting a bit off track. Optimizing the runtime of the update is great, but this is a one-off (hopefully) event. I want to accomplish it as quickly as possible, of course, but at the same time it doesn’t make sense to spend a lot of time optimizing every component of the query. The main purpose of the question was honestly for my sanity, to reduce the likelihood of having it run for several hours only to error out due to bad data or whatever and have to start over from the top. Running in parallel simply seemed to be a no-brainer option to make it go quicker, assuming CPU bound updating. Optimizations that are going to take work are probably not worth it. We can wait for the data to be updated.
Thanks again!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
Show quoted text
On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster <ijbrewster@alaska.edu <mailto:ijbrewster@alaska.edu>> wrote:
On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com <mailto:mlewis@entrata.com>> wrote:
I’m thinking it might be worth it to do a “quick” test on 1,000 or so records (or whatever number can run in a minute or so), watching the processor utilization as it runs. That should give me a better feel for where the bottlenecks may be, and how long the entire update process would take. I’m assuming, of course, that the total time would scale more or less linearly with the number of records.
I think that depends on how your identify and limit the update to those 1000 records. If it is using a primary key with specific keys in an array, probably close to linear increase because the where clause isn't impactful to the overall execution time. If you write a sub-query that is slow, then you would need to exclude that from the time. You can always run explain analyze on the update and rollback rather than commit.
So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN ANALYZE output at https://explain.depesz.com/s/lIYn <https://explain.depesz.com/s/lIYn> if it matters), during which time I did see a postmaster process consuming 100% CPU. Upping the test to 20,819 records took about 16.5 seconds, so that looks relatively linear to me. Also, CPU bound. So by my calculations, doing all 50M records would take around 10 hours.
One potentially significant note: most of the execution time is spent in a trigger. This trigger is actually what’s doing the REAL update that I need to happen. If it would make a difference, I could easily pull the trigger code out to a separate function that I just call directly (with triggers temporarily disabled). My thinking is that calling a function is calling a function, and the fact that it is currently called via a trigger rather than direct is of little consequence, but I’m willing to be corrected on that :-)
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145
On Mon, 6 Jan 2020 at 17:38, Israel Brewster <ijbrewster@alaska.edu> wrote:
Sure. But I feel we are getting a bit off track. Optimizing the runtime of
the update is great, but this is a one-off (hopefully) event. I want to
accomplish it as quickly as possible, of course, but at the same time it
doesn’t make sense to spend a lot of time optimizing every component of the
query. The main purpose of the question was honestly for my sanity, to
reduce the likelihood of having it run for several hours only to error out
due to bad data or whatever and have to start over from the top. Running in
parallel simply seemed to be a no-brainer option to make it go quicker,
assuming CPU bound updating. Optimizations that are going to take work are
probably not worth it. We can wait for the data to be updated.
It sounds like you're in a decent place on this, and that you have done a
pretty apropos amount of exploration of the matter.
I was pleased to hear that you have the idempotency of the updates well in
hand, and that the application can cope with the degree of out-of-sync that
things will temporarily be.
The estimate of 10h to update the data doesn't surprise me; that's long
enough that it sure seems tempting to do the work in pieces so that you
don't have your whole set of application data locked for 10h.
I'd be inclined to call this "enough attention" for a one-off event.
I'll poke at the trigger aspect a wee bit; if the trigger function does a
one-tuple-at-a-time handling of things, so that it fires 50M times, you
might get a substantial speedup by replacing that with an equivalent set
operation that processes a few thousand tuples at a time. That said, if
you're happy with the process running 10h, it's not worth unpeeling the
extra testing needed to ensure identical end states.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
system monitor i like is glances, have to have python installed to run,
it can run in command console or present a web interface. its a very nice
to get a quick detail few what is going on
https://nicolargo.github.io/glances/
just monitoring the system CPU utilization does not give enough information
to state if the system is CPU bound or IO bound or network IO bound.
on simple selects assuming the data is in shared buffers its trivial
event, if the select has to go to disk to get the data then its not
trivial especially if its a big table. One thing that can happen is the
update and the selects in the trigger are pushing data in and out of shared
memory causing the disk to thrash which explain does not report.
Explain does not peer into function, triggers, etc select myfunction(),
column list, from mytable. or update mytable set column = myfunction() ,
it will not report what happened inside just the total time it took to run.
looking at the explain the Update only took 0.6 second while the trigger
took 6.5
rotate_tilt_data: time=6560.401 calls=9299
I'm a little confused is this trigger function going to be around after
this update? If yes it should be worth looking into
If its a one time run or every 12 months who cares, Start the update on
friday night, go in on Saturday to check it
On Mon, Jan 6, 2020 at 5:38 PM Israel Brewster <ijbrewster@alaska.edu>
wrote:
Show quoted text
On Jan 6, 2020, at 12:49 PM, Justin <zzzzz.graf@gmail.com> wrote:
What was the HD wait time ? What tool is being use to monitor the server
resources??No idea on the HD wait time - how would I best monitor that? That said,
this machine does have NVMe drives, so the speed should be fairly high/wait
time fairly low. It’s also running as a VM, which could affect things, but
is a bit of a moot point as far as this update goes. As far as monitoring
server resources, I was just using top.It appears based on this information there is allot more going on than a
simple Update commandDepending on your definition of “simple update” of course, very true. As I
stated in the original message, the actual update value is the result of a
function. The psql command is a simple update, but the function does a bit
of stuff (primarily trigonometry). According to the EXPLAIN ANALYZE, about
.7 ms of stuff per record, which of course is most of the runtime. It is
entirely possible that the function could be optimized to run more quickly.Moving code out of the trigger probably not going to improve
performance, unless there is allot of code that does not need to be
processed for this update or code touching other tablesOne SELECT query on another table to get some values I need to use for the
calculation. No code that is not needed for the update. Given the nature of
this bulk update, I *could* make a separate function that simply takes
those values as parameters, since the same value will be applied to a lot
of records. I’d just have to be careful about how I applied the update, so
rows get processed with the correct values. I’m not convinced it would be
worth it though - might shave a few hours off the total execution time
(assuming that SELECT is expensive - EXPLAIN ANLYZE shows an index scan, on
a table with only 12,761 rows, which seems to be about as simple as it
gets), but I doubt it would be enough for me to feel comfortable simply
running the update as one monolithic unit.Study the trigger identify what has to run, pull that code out, then
disable the trigger. Move the necessary code to a new function for
Updating..Sure. But I feel we are getting a bit off track. Optimizing the runtime of
the update is great, but this is a one-off (hopefully) event. I want to
accomplish it as quickly as possible, of course, but at the same time it
doesn’t make sense to spend a lot of time optimizing every component of the
query. The main purpose of the question was honestly for my sanity, to
reduce the likelihood of having it run for several hours only to error out
due to bad data or whatever and have to start over from the top. Running in
parallel simply seemed to be a no-brainer option to make it go quicker,
assuming CPU bound updating. Optimizations that are going to take work are
probably not worth it. We can wait for the data to be updated.Thanks again!
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145On Mon, Jan 6, 2020 at 4:24 PM Israel Brewster <ijbrewster@alaska.edu>
wrote:On Jan 6, 2020, at 11:54 AM, Michael Lewis <mlewis@entrata.com> wrote:
I’m thinking it might be worth it to do a “quick” test on 1,000 or so
records (or whatever number can run in a minute or so), watching the
processor utilization as it runs. That should give me a better feel for
where the bottlenecks may be, and how long the entire update process would
take. I’m assuming, of course, that the total time would scale more or less
linearly with the number of records.I think that depends on how your identify and limit the update to those
1000 records. If it is using a primary key with specific keys in an array,
probably close to linear increase because the where clause isn't impactful
to the overall execution time. If you write a sub-query that is slow, then
you would need to exclude that from the time. You can always run explain
analyze on the update and rollback rather than commit.So a test run on 9,299 records took about 7 seconds to complete (EXPLAIN
ANALYZE output at https://explain.depesz.com/s/lIYn if it matters),
during which time I did see a postmaster process consuming 100% CPU. Upping
the test to 20,819 records took about 16.5 seconds, so that looks
relatively linear to me. Also, CPU bound. So by my calculations, doing all
50M records would take around 10 hours.One potentially significant note: most of the execution time is spent in
a trigger. This trigger is actually what’s doing the REAL update that I
need to happen. If it would make a difference, I could easily pull the
trigger code out to a separate function that I just call directly (with
triggers temporarily disabled). My thinking is that calling a function is
calling a function, and the fact that it is currently called via a trigger
rather than direct is of little consequence, but I’m willing to be
corrected on that :-)---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell: 907-328-9145