Add column with default value in big table - splitting of updates can help?

Started by Durumdaraabout 6 years ago6 messagesgeneral
Jump to latest
#1Durumdara
durumdara@gmail.com

Dear Members!

I've read this article, but I need your experience in this theme.

https://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql#.XjL3fcqYXDc

alter table tk
add colum field1 default 'MUCH';

The table tk have 200 million rows. The autovacuum is no problem, only the
long update.

But as I read the alter makes table lock, so this update locks the table
for long time.

The article said we need to do this:

1. add column without default - fast.
2. set default on column.
3. update it where is null.

What we can save with this method?

As I suppose the table lock substituted with long update (row locks on
whole table).

The article suggested to replace long update to shorter sequences
(10000-100000 records by cycle).

We used to execute these SQL-s (alter, default, update) in one transaction.
So I can't make commit there.

What is the difference between "full update" and "updates by 10000 records"
when I'm in a single transaction?

Is it possible that this pseudo code makes less locks WITHOUT INNER COMMIT?

Pseudo:
----------------------
begin
while True loop
update tk set field1 = ' MUCH' when field1 is NULL and id in
(select id from tk where field1 is NULL limit 100000);
if not FOUND then
break;
end while;
end
----------------------

Is there any advance when I split updates? I'm in one transaction.

TR. START
----
1. alter add col
2. set default
3. updates
---
TR. COMMIT

Or it isn't help me?

Because the whole transaction locks the other users also, just like "alter
add colum wit hdefault statement"?

Thank you for your and help!

Best regards
dd

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Durumdara (#1)
Re: Add column with default value in big table - splitting of updates can help?

On 1/30/20 7:51 AM, Durumdara wrote:

Dear Members!

I've read this article, but I need your experience in this theme.

https://leopard.in.ua/2016/09/20/safe-and-unsafe-operations-postgresql#.XjL3fcqYXDc

alter table tk
  add colum field1 default 'MUCH';

The table tk have 200 million rows. The autovacuum is no problem, only
the long update.

But as I read the alter makes table lock, so this update locks the table
for long time.

What version of Postgres are you using?

I ask because:

https://www.postgresql.org/docs/11/release-11.html

"Many other useful performance improvements, including the ability to
avoid a table rewrite for ALTER TABLE ... ADD COLUMN with a non-null
column default"

The article said we need to do this:

1. add column without default - fast.
2. set default on column.
3. update it where is null.

What we can save with this method?

As I suppose the table lock substituted with long update (row locks on
whole table).

The article suggested to replace long update to shorter sequences
(10000-100000 records by cycle).

We used to execute these SQL-s (alter, default, update) in one transaction.
So I can't make commit there.

What is the difference between "full update" and "updates by 10000
records" when I'm in a single transaction?

Is it possible that this pseudo code makes less locks WITHOUT INNER COMMIT?

Pseudo:
----------------------
begin
  while True loop
     update tk set field1 = ' MUCH' when field1 is NULL and id in
        (select id from tk where field1 is NULL limit 100000);
     if not FOUND then
        break;
  end while;
end
  ----------------------

Is there any advance when I split updates? I'm in one transaction.

TR. START
----
1. alter add col
2. set default
3. updates
---
TR. COMMIT

Or it isn't help me?

Because the whole transaction locks the other users also, just like
"alter add colum wit hdefault statement"?

Thank you for your and help!

Best regards
   dd

--
Adrian Klaver
adrian.klaver@aklaver.com

In reply to: Durumdara (#1)
Re: Add column with default value in big table - splitting of updates can help?

On Thu, Jan 30, 2020 at 04:51:02PM +0100, Durumdara wrote:

Is there any advance when I split updates? I'm in one transaction.

In this case - no. The benefit of split updates is when you're not in
single transaction.

Why would you want to have it all done in single transaction?

Best regards,

depesz

#4Durumdara
durumdara@gmail.com
In reply to: hubert depesz lubaczewski (#3)
Re: Add column with default value in big table - splitting of updates can help?

Hello!

a.)
PG version is mainly 9.6, but some of the servers are 10.x or 11.x.

b.)
We have semi-automatic tool which get the a new modifications on databases,
and execute them at once by database.
So one SQL script by one database, under one transaction - whole or
nothing. If something failed, we know where to we start again by hand. It
is impossible to execute only the first half, and we don't know which one
executed or not.

The main problem that sometimes we have to modify some tables which have
too much records in some customer databases.

---

As I think the best way to solve this:

1.)
Future goal: upgrade to PG 11.

2.)
We need to write a preprocessor code. If it finds "alter table" with "add
column" and "default", and it is "critical database", the whole operation
halts on that point, and warn us to "do it yourself"... :-)

---

After the previous same problem on tk table I tried to write a client app,
which update records by 10000 with commit - but it was very slow.

update tk set field1 = 'MUCH' where id in (
select id from tk where field1 is NULL limit 10000
)

I think this caused that in the goal field haven't got index (because many
times the fields with default values are not searchable, just row level
fields), and the client/server communication is slower than I like.

Formerly I thought I could speeding up this with stored proc - but as I
read the stored procs can't use inner transactions - so I must make client
programme to use begin/commit... (PGAdmin?).

Thanks for the prior infos!

Best regards
dd

hubert depesz lubaczewski <depesz@depesz.com> ezt írta (időpont: 2020. jan.
30., Cs, 17:20):

Show quoted text

On Thu, Jan 30, 2020 at 04:51:02PM +0100, Durumdara wrote:

Is there any advance when I split updates? I'm in one transaction.

In this case - no. The benefit of split updates is when you're not in
single transaction.

Why would you want to have it all done in single transaction?

Best regards,

depesz

#5Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Durumdara (#4)
Re: Add column with default value in big table - splitting of updates can help?

On 1/31/20 5:43 AM, Durumdara wrote:

Hello!

a.)
PG version is mainly 9.6, but some of the servers are 10.x or 11.x.

b.)
We have semi-automatic tool which get the a new modifications on
databases, and execute them at once by database.
So one SQL script by one database, under one transaction - whole or
nothing. If something failed, we know where to we start again by hand.
It is impossible to execute only the first half, and we don't know which
one executed or not

The main problem that sometimes we have to modify some tables which have
too much records in some customer databases.

---

As I think the best way to solve this:

1.)
Future goal: upgrade to PG 11.

2.)
We need to write a preprocessor code. If it finds "alter table" with
"add column" and "default", and it is "critical database", the whole
operation halts on that point, and warn us to "do it yourself"... :-)

---

After the previous same problem on tk table I tried to write a client
app, which update records by 10000 with commit - but it was very slow.

   update tk set field1 = 'MUCH' where id in (
      select id from tk where field1 is NULL limit 10000
   )

I think this caused that in the goal field haven't got index (because
many times the fields with default values are not searchable, just row
level fields), and the client/server communication is slower than I like.

Formerly I thought I could speeding up this with stored proc - but as I
read the stored procs can't use inner transactions - so I must make
client programme to use begin/commit... (PGAdmin?).

Up to version 11 Postgres only had stored functions. With 11+ there are
stored procedures and inner transactions:

https://www.postgresql.org/docs/11/plpgsql-transactions.html

In stored functions the best you can do is:

https://www.postgresql.org/docs/11/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING

Thanks for the prior infos!

Best regards
   dd

hubert depesz lubaczewski <depesz@depesz.com <mailto:depesz@depesz.com>>
ezt írta (időpont: 2020. jan. 30., Cs, 17:20):

On Thu, Jan 30, 2020 at 04:51:02PM +0100, Durumdara wrote:

Is there any advance when I split updates? I'm in one transaction.

In this case - no. The benefit of split updates is when you're not in
single transaction.

Why would you want to have it all done in single transaction?

Best regards,

depesz

--
Adrian Klaver
adrian.klaver@aklaver.com

#6Kevin Brannen
KBrannen@efji.com
In reply to: Durumdara (#4)
RE: Add column with default value in big table - splitting of updates can help?

From: Durumdara <durumdara@gmail.com>

a.)
PG version is mainly 9.6, but some of the servers are 10.x or 11.x.
b.)
We have semi-automatic tool which get the a new modifications on databases, and execute them at once by database.
So one SQL script by one database, under one transaction - whole or nothing. If something failed, we know where to we start again by hand. It is impossible to execute only the first half, and we don't know which one executed or not.

Unless you have some special requirement, you don't have to do it all or
nothing. As Despez points out, you can do it in multiple transactions just
fine. We do it that way here all the time. :)

What it means is that you have to write guards or checks in your upgrade script.
In some instances, it's very easy because some statements have IF NOT EXISTS to
help you. For those places where doing the same statement twice would cause an
error, then put a check around it. A plpgsql DO block allows for conditions,
then only if the condition check fails, you do the work (the information_schema and
pg_catalog tables are your friend for this).

The point is that you can run your upgrade script as many times as needed, should
something happen and it stops. Rerunning the upgrade script should never
cause an error if you've coded it correctly.

The main problem that sometimes we have to modify some tables which have too much records in some customer databases.

I'm going to go with everyone else here because it works. We tend to do updates
in blocks of 10K records at a time. Do some tests and figure out what works best
for your setup (maybe you have enough memory to do 100K chunks). Whatever you
do, make sure that the column you use to divide the work has an index on it! Use
the primary key if it's an INT. If you have no index, find a column you can create
an index on for this work then drop it at the end; that will be far faster than having
to do a bunch of table scans.

Yes, it's probably a very good idea to upgrade to a newer version if you can
as performance improvements come with each new version.

HTH,
Kevin
This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain confidential information. If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, distribution, review, copy or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify us by reply e-mail, and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.