Thoughts on how to avoid a massive integer update.
Hi all,
This is a shot in the dark in hopes to find a magic bullet to fix an issue I have, I can’t personally think of any solution myself.
I have a database with hundreds of terabytes of data, where every table has an integer column referencing a small table. For reasons out of my control and cannot change, I NEED to update every single row in all these tables, changing the integer value to a different integer.
Since I have to deal with dead space, I can only do a couple tables at a time, then do a vacuum full after each one.
Another option is to build a new table with the new values, then drop the old one and swap in the new, either way is very time consuming.
Initial tests suggest this effort will take several months to complete, not to mention cause blocking issues on tables being worked on.
Does anyone have any hackery ideas on how to achieve this in less time? I was looking at possibly converting the integer column type to another that would present the integer differently, like a hex value, but everything still ends up requiring all data to be re-written to disk. In a well designed database (I didn’t design it :) ), I would simply change the data in the referenced table (200 total rows), however the key being referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be changed.
Thanks for any thoughts or ideas,
* Brian F
On 5/4/20 2:32 PM, Fehrle, Brian wrote:
Hi all,
This is a shot in the dark in hopes to find a magic bullet to fix an
issue I have, I can’t personally think of any solution myself.I have a database with hundreds of terabytes of data, where every table
has an integer column referencing a small table. For reasons out of my
control and cannot change, I NEED to update every single row in all
these tables, changing the integer value to a different integer.Since I have to deal with dead space, I can only do a couple tables at a
time, then do a vacuum full after each one.
Why?
A regular vacuum would mark the space as available.
More below.
Another option is to build a new table with the new values, then drop
the old one and swap in the new, either way is very time consuming.Initial tests suggest this effort will take several months to complete,
not to mention cause blocking issues on tables being worked on.Does anyone have any hackery ideas on how to achieve this in less time?
I was looking at possibly converting the integer column type to another
that would present the integer differently, like a hex value, but
everything still ends up requiring all data to be re-written to disk. In
a well designed database (I didn’t design it :) ), I would simply change
the data in the referenced table (200 total rows), however the key being
referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be
changed.
I'm not following above.
Could you show an example table relationship?
Thanks for any thoughts or ideas,
* Brian F
--
Adrian Klaver
adrian.klaver@aklaver.com
On 5/4/20 3:32 PM, Fehrle, Brian wrote:
Hi all,
This is a shot in the dark in hopes to find a magic bullet to fix an
issue I have, I can’t personally think of any solution myself.I have a database with hundreds of terabytes of data, where every
table has an integer column referencing a small table. For reasons out
of my control and cannot change, I NEED to update every single row in
all these tables, changing the integer value to a different integer.Since I have to deal with dead space, I can only do a couple tables at
a time, then do a vacuum full after each one.
Another option is to build a new table with the new values, then drop
the old one and swap in the new, either way is very time consuming.Initial tests suggest this effort will take several months to
complete, not to mention cause blocking issues on tables being worked on.Does anyone have any hackery ideas on how to achieve this in less
time? I was looking at possibly converting the integer column type to
another that would present the integer differently, like a hex value,
but everything still ends up requiring all data to be re-written to
disk. In a well designed database (I didn’t design it :) ), I would
simply change the data in the referenced table (200 total rows),
however the key being referenced isn’t just an arbitrary ID, it’s
actual ‘data’, and must be changed.Thanks for any thoughts or ideas,
* Brian F
Is the new value computable from the original value with a single
function? Could you cover your tables with viewa transforming the column
with said function?
No? You'll need to divorce each table from that lookup table and any
foreign key relationships to avoid all those lookups with ever update.
On 5/4/20 3:32 PM, Fehrle, Brian wrote:
Hi all,
This is a shot in the dark in hopes to find a magic bullet to fix an
issue I have, I can’t personally think of any solution myself.I have a database with hundreds of terabytes of data, where every
table has an integer column referencing a small table. For reasons out
of my control and cannot change, I NEED to update every single row in
all these tables, changing the integer value to a different integer.Since I have to deal with dead space, I can only do a couple tables at
a time, then do a vacuum full after each one.
Another option is to build a new table with the new values, then drop
the old one and swap in the new, either way is very time consuming.Initial tests suggest this effort will take several months to
complete, not to mention cause blocking issues on tables being worked on.Does anyone have any hackery ideas on how to achieve this in less
time? I was looking at possibly converting the integer column type to
another that would present the integer differently, like a hex value,
but everything still ends up requiring all data to be re-written to
disk. In a well designed database (I didn’t design it :) ), I would
simply change the data in the referenced table (200 total rows),
however the key being referenced isn’t just an arbitrary ID, it’s
actual ‘data’, and must be changed.Thanks for any thoughts or ideas,
* Brian F
Here's my wife solution: add a column to small table, fill with
duplicate of the original column. Change the foreign keys of the huge
tables to reference the new column in the small table. Update the
original values to what you now need them to be. (Not sure how much
re-index cost you'll see here, but non-zero I'm guessing)
Your reports now must of course return the new value via joins to the
dictionary(?) table. Views my be your best bet here.
On 2020-05-04 21:32:56 +0000, Fehrle, Brian wrote:
I have a database with hundreds of terabytes of data, where every table has an
integer column referencing a small table. For reasons out of my control and
cannot change, I NEED to update every single row in all these tables, changing
the integer value to a different integer.Since I have to deal with dead space, I can only do a couple tables at a time,
then do a vacuum full after each one.
Do you have to change all the values at the same time? If you can change
them one (or a few) at a time, the tables shouldn't bloat much.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
On 5/4/20, 3:56 PM, "Adrian Klaver" <adrian.klaver@aklaver.com> wrote:
[External Email]
On 5/4/20 2:32 PM, Fehrle, Brian wrote:
Hi all,
This is a shot in the dark in hopes to find a magic bullet to fix an
issue I have, I can’t personally think of any solution myself.I have a database with hundreds of terabytes of data, where every table
has an integer column referencing a small table. For reasons out of my
control and cannot change, I NEED to update every single row in all
these tables, changing the integer value to a different integer.Since I have to deal with dead space, I can only do a couple tables at a
time, then do a vacuum full after each one.
Why?
A regular vacuum would mark the space as available.
A regular vacuum would mark the space as available – ***for re-use***, it will not release the space back to the drive as ‘unused’. 99% of my tables are old data that will not receive any future inserts or updates, which means that space that is marked ready for ‘reuse’ will not ever be used.
This means that a 100GB table will be updated, and every row marked as dead and re-created with the newly updated data. This table is now 200GB in size. A vacuum will keep it at 200GB of space used, freeing up the 100GB of dead space as ready-to-reuse. A vacuum full will make it 100GB again.
Since 99% of my tables will never be updated or inserted into again, this means my ~300 Terabytes of data would be ~600 Terabytes of data on disk. Thus, vacuum full.
More below.
Another option is to build a new table with the new values, then drop
the old one and swap in the new, either way is very time consuming.Initial tests suggest this effort will take several months to complete,
not to mention cause blocking issues on tables being worked on.Does anyone have any hackery ideas on how to achieve this in less time?
I was looking at possibly converting the integer column type to another
that would present the integer differently, like a hex value, but
everything still ends up requiring all data to be re-written to disk. In
a well designed database (I didn’t design it :) ), I would simply change
the data in the referenced table (200 total rows), however the key being
referenced isn’t just an arbitrary ID, it’s actual ‘data’, and must be
changed.
I'm not following above.
Could you show an example table relationship?
It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer
*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),
Thanks for any thoughts or ideas,
* Brian F
--
Adrian Klaver
adrian.klaver@aklaver.com
Could you show an example table relationship?
It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),
Right, and now you wish to change the values in the referenced table (info_table.info_table_sid) correct?
On Mon, May 4, 2020 at 2:33 PM Fehrle, Brian <bfehrle@comscore.com> wrote:
I NEED to update every single row in all these tables, changing the
integer value to a different integer.
Does anyone have any hackery ideas on how to achieve this in less time?
Probably the only solution that would perform computationally faster would
take the same amount of time or more to code and debug, and be considerably
riskier. Basically shut down PostgreSQL and modify the data files directly
to change one integer byte sequence to another. On the positive side the
source code for PostgreSQL is open source and that data, while complex, is
structured.
On the point of "vacuum" versus "vacuum full" - I don't know if this is how
it would work in reality but conceptually if you updated half the table,
vacuumed, updated the second half, vacuumed, the second batch of updates
would reuse the spaced freed from the first batch and you'd only increase
the disk consumption by 1.5 instead of 2.0. As you increase the number of
batches the percentage of additional space consumed decreases. Though if
you have the space I'd have to imagine that creating a brand new table and
dropping the old one would be the best solution when taken in isolation.
David J.
From: "David G. Johnston" <david.g.johnston@gmail.com>
Date: Friday, May 8, 2020 at 11:48 AM
To: "Fehrle, Brian" <bfehrle@comscore.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.
[External Email]
On Mon, May 4, 2020 at 2:33 PM Fehrle, Brian <bfehrle@comscore.com<mailto:bfehrle@comscore.com>> wrote:
I NEED to update every single row in all these tables, changing the integer value to a different integer.
Does anyone have any hackery ideas on how to achieve this in less time?
Probably the only solution that would perform computationally faster would take the same amount of time or more to code and debug, and be considerably riskier. Basically shut down PostgreSQL and modify the data files directly to change one integer byte sequence to another. On the positive side the source code for PostgreSQL is open source and that data, while complex, is structured.
On the point of "vacuum" versus "vacuum full" - I don't know if this is how it would work in reality but conceptually if you updated half the table, vacuumed, updated the second half, vacuumed, the second batch of updates would reuse the spaced freed from the first batch and you'd only increase the disk consumption by 1.5 instead of 2.0. As you increase the number of batches the percentage of additional space consumed decreases. Though if you have the space I'd have to imagine that creating a brand new table and dropping the old one would be the best solution when taken in isolation.
David J.
Modifying data files is too risky and I wouldn’t be able to get that kind of work approved.
Even with keeping excess space to an additional 50%, that’s tons of storage I’d need to order, so either vacuum full or re-create tables for minimal on disk usage are my only options.
From: Rob Sargent <robjsargent@gmail.com>
Date: Friday, May 8, 2020 at 11:05 AM
To: "Fehrle, Brian" <bfehrle@comscore.com>
Cc: Adrian Klaver <adrian.klaver@aklaver.com>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.
[External Email]
Could you show an example table relationship?
It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer
*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),
Right, and now you wish to change the values in the referenced table (info_table.info_table_sid) correct?
Correct. If info_table.info_table_sid = 123 and many rows in data_table point to it, the ID needs to be changed to 456 in the info_table, as well as all the columns in the data_table.
Well as I said, I think you could add a column to info_table
alter table info_table add orig_id int;
update info_table set orig_id = info_table_sid;
update info_table set info_table_sid = 456 where info_table_sid = 456;
alter table data_table drop reference NOT SQL
alter table data_table make reference to info_table.orig_id NOT SQL
you have to do the second block per orig_id
and the third block per table
at your reports needing the new value will of course need to do the join and get the updated value which now resides in the ill-name info_table_sid.
This leaves “orig_id” as just an id and “info_table_sid” as an editable attribute
Nothing is broken other than the per-table lock while you switch the foreign key
Show quoted text
On May 8, 2020, at 1:36 PM, Fehrle, Brian <bfehrle@comscore.com> wrote:
From: Rob Sargent <robjsargent@gmail.com <mailto:robjsargent@gmail.com>>
Date: Friday, May 8, 2020 at 11:05 AM
To: "Fehrle, Brian" <bfehrle@comscore.com <mailto:bfehrle@comscore.com>>
Cc: Adrian Klaver <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>>, "pgsql-general@postgresql.org <mailto:pgsql-general@postgresql.org>" <pgsql-general@postgresql.org <mailto:pgsql-general@postgresql.org>>
Subject: Re: Thoughts on how to avoid a massive integer update.[External Email]
Could you show an example table relationship?
It’s a simple one-to-many relationship:
*Info_table*
info_table_sid integer*data_table*
data_table_sid integer,
info_table_id integer references info_table(info_table_sid),Right, and now you wish to change the values in the referenced table (info_table.info_table_sid) correct?
Correct. If info_table.info_table_sid = 123 and many rows in data_table point to it, the ID needs to be changed to 456 in the info_table, as well as all the columns in the data_table.
On Fri, May 8, 2020 at 12:49 PM Rob Sargent <robjsargent@gmail.com> wrote:
Well as I said, I think you could add a column to info_table
alter table info_table add orig_id int;
update info_table set orig_id = info_table_sid;update info_table set info_table_sid = 456 where info_table_sid = 456;
huh?
alter table data_table *drop reference NOT SQL*
alter table data_table *make reference to info_table.orig_id NOT SQL*
You don't seem to understand the requirement. The data_table integer value
must be changed - all you are doing is a convoluted column rename on the
table holding the PK half of the relationship.
David J.
On May 8, 2020, at 2:37 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Fri, May 8, 2020 at 12:49 PM Rob Sargent <robjsargent@gmail.com <mailto:robjsargent@gmail.com>> wrote:
Well as I said, I think you could add a column to info_table
alter table info_table add orig_id int;
update info_table set orig_id = info_table_sid;update info_table set info_table_sid = 456 where info_table_sid = 456;
huh?
alter table data_table drop reference NOT SQL
alter table data_table make reference to info_table.orig_id NOT SQLYou don't seem to understand the requirement. The data_table integer value must be changed - all you are doing is a convoluted column rename on the table holding the PK half of the relationship.
David J.
My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?
On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:
My understanding is the keys in the info_table need to change. That
causes the very expensive update in the update in the data tables. No?
The keys in the info_table need to change because their contents are no
longer legal to be stored (OP has not specified but think using an integer
value of someones social security number as a key). The FK side of the
relationship equality has the same illegal data values problem and need to
be changed too.
David J.
On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com <mailto:robjsargent@gmail.com>> wrote:
My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.
David J.
Wow, I couldn’t disagree more ;)
Let’s say it were an ssn. A side from the fact that those are mutable and should not be used as an id (not to say identifier), the ssn should only be in the table aligned with that_person. Other tables refering to that person should use the locally assigned and arbitrary identifier for that person. (Frankly there is no “natural key” for a person.)
The scheme I propose has a chance of completing relatively unobtrusively and quickly allowing a migration over time if deemed necessary. Likely to finished just in time for the next interuption.
On Fri, May 8, 2020 at 1:51 PM Rob Sargent <robjsargent@gmail.com> wrote:
On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com>
wrote:On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com> wrote:
My understanding is the keys in the info_table need to change. That
causes the very expensive update in the update in the data tables. No?The keys in the info_table need to change because their contents are no
longer legal to be stored (OP has not specified but think using an integer
value of someones social security number as a key). The FK side of the
relationship equality has the same illegal data values problem and need to
be changed too.Wow, I couldn’t disagree more ;)
Your agreement or disagreement with the problem statement is immaterial
here - the OP has stated what the requirement, for which I have made a
simplistic analogy in order to try and get the point across to you. As the
OP has said it is a poor design - and now it is being corrected. The
request is whether there is some way to do so better than the two options
the OP already described.
David J.
On May 8, 2020, at 2:57 PM, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Fri, May 8, 2020 at 1:51 PM Rob Sargent <robjsargent@gmail.com <mailto:robjsargent@gmail.com>> wrote:
On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>> wrote:
On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com <mailto:robjsargent@gmail.com>> wrote:
My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.
Wow, I couldn’t disagree more ;)
Your agreement or disagreement with the problem statement is immaterial here - the OP has stated what the requirement, for which I have made a simplistic analogy in order to try and get the point across to you. As the OP has said it is a poor design - and now it is being corrected. The request is whether there is some way to do so better than the two options the OP already described.
David J.
Sorry, I wasn’t disagreeing with the problem statement. OP did say the “info.id” needed to change from 123 to 456. With the current foreign key alignment that is very expensive. I think we’re all in agreement there. To push “456” back out to the data table I see as perpetuation of the problem. I didn’t sense that OP felt it necessary to continue in the current mode as a requirement. If so, my mistake
David’s assessment is correct (and I think we’re all on the same page). The value of the foreign keys that tie the tables together must be changed, and yeah that value _should_ simply be an additional column in the info_table and the foreign key be an arbitrary integer, but since it wasn’t set up that way from the beginning (over a decade ago), this is what I’m stuck with.
Blah.
From: Rob Sargent <robjsargent@gmail.com>
Date: Friday, May 8, 2020 at 3:05 PM
To: "David G. Johnston" <david.g.johnston@gmail.com>
Cc: "Fehrle, Brian" <bfehrle@comscore.com>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Thoughts on how to avoid a massive integer update.
[External Email]
On May 8, 2020, at 2:57 PM, David G. Johnston <david.g.johnston@gmail.com<mailto:david.g.johnston@gmail.com>> wrote:
On Fri, May 8, 2020 at 1:51 PM Rob Sargent <robjsargent@gmail.com<mailto:robjsargent@gmail.com>> wrote:
On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com<mailto:david.g.johnston@gmail.com>> wrote:
On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com<mailto:robjsargent@gmail.com>> wrote:
My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?
The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.
Wow, I couldn’t disagree more ;)
Your agreement or disagreement with the problem statement is immaterial here - the OP has stated what the requirement, for which I have made a simplistic analogy in order to try and get the point across to you. As the OP has said it is a poor design - and now it is being corrected. The request is whether there is some way to do so better than the two options the OP already described.
David J.
Sorry, I wasn’t disagreeing with the problem statement. OP did say the “info.id<https://linkprotect.cudasvc.com/url?a=http%3a%2f%2finfo.id&c=E,1,l7B8bw8isNYaTDkm2_hIVb79FGTulxe9Tia8l_UH_XSHi2D5lYB_8XDLez1wLFLAJRgh9Pmyu4VZJSklgkgItDzOjCQxP-MtImoIUALMbg,,&typo=1>” needed to change from 123 to 456. With the current foreign key alignment that is very expensive. I think we’re all in agreement there. To push “456” back out to the data table I see as perpetuation of the problem. I didn’t sense that OP felt it necessary to continue in the current mode as a requirement. If so, my mistake
On May 8, 2020, at 3:52 PM, Fehrle, Brian <bfehrle@comscore.com> wrote:
David’s assessment is correct (and I think we’re all on the same page). The value of the foreign keys that tie the tables together must be changed, and yeah that value _should_ simply be an additional column in the info_table and the foreign key be an arbitrary integer, but since it wasn’t set up that way from the beginning (over a decade ago), this is what I’m stuck with.
Blah.
Uncle.
And you’re stuck with it because no-one joins back to info, I take it? Denormalization writ large. Oh heck do the switcheroo and mail out a magic decoder ring ;)
Show quoted text
From: Rob Sargent <robjsargent@gmail.com <mailto:robjsargent@gmail.com>>
Date: Friday, May 8, 2020 at 3:05 PM
To: "David G. Johnston" <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>>
Cc: "Fehrle, Brian" <bfehrle@comscore.com <mailto:bfehrle@comscore.com>>, "pgsql-general@postgresql.org <mailto:pgsql-general@postgresql.org>" <pgsql-general@postgresql.org <mailto:pgsql-general@postgresql.org>>
Subject: Re: Thoughts on how to avoid a massive integer update.[External Email]
On May 8, 2020, at 2:57 PM, David G. Johnston <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>> wrote:
On Fri, May 8, 2020 at 1:51 PM Rob Sargent <robjsargent@gmail.com <mailto:robjsargent@gmail.com>> wrote:
On May 8, 2020, at 2:43 PM, David G. Johnston <david.g.johnston@gmail.com <mailto:david.g.johnston@gmail.com>> wrote:
On Fri, May 8, 2020 at 1:41 PM Rob Sargent <robjsargent@gmail.com <mailto:robjsargent@gmail.com>> wrote:
My understanding is the keys in the info_table need to change. That causes the very expensive update in the update in the data tables. No?
The keys in the info_table need to change because their contents are no longer legal to be stored (OP has not specified but think using an integer value of someones social security number as a key). The FK side of the relationship equality has the same illegal data values problem and need to be changed too.
Wow, I couldn’t disagree more ;)
Your agreement or disagreement with the problem statement is immaterial here - the OP has stated what the requirement, for which I have made a simplistic analogy in order to try and get the point across to you. As the OP has said it is a poor design - and now it is being corrected. The request is whether there is some way to do so better than the two options the OP already described.
David J.
Sorry, I wasn’t disagreeing with the problem statement. OP did say the “info.id <https://linkprotect.cudasvc.com/url?a=http%3a%2f%2finfo.id&c=E,1,l7B8bw8isNYaTDkm2_hIVb79FGTulxe9Tia8l_UH_XSHi2D5lYB_8XDLez1wLFLAJRgh9Pmyu4VZJSklgkgItDzOjCQxP-MtImoIUALMbg,,&typo=1>” needed to change from 123 to 456. With the current foreign key alignment that is very expensive. I think we’re all in agreement there. To push “456” back out to the data table I see as perpetuation of the problem. I didn’t sense that OP felt it necessary to continue in the current mode as a requirement. If so, my mistake