Potential optimisation for the creation of a partial index condition over a newly created nullable, non-default-valued column?

Started by Tim Dawbornalmost 6 years ago3 messagesgeneral

tim.dawborn@gmail.com

almost 6 years ago

Hi all,

I was playing around with ways to make a schema change recently to a ~30M
record table. I wanted to add a new nullable, non-default-valued column to
this existing table, and then add a new partial to that table, where the
partial index condition refers to a value in that newly added column. I was
expecting that there might be an optimisation here that PostgreSQL could
make, given the partial index condition could not be hit, but it seems not.

Here's what I was playing with:

tmp=> \timing on
Timing is on.
tmp=> BEGIN;
BEGIN
Time: 1.333 ms
tmp=> ALTER TABLE foo ADD COLUMN d integer NULL;
ALTER TABLE
Time: 1.581 ms
tmp=> CREATE UNIQUE INDEX myindex ON foo (a, b, c) where d = 2;
CREATE INDEX
Time: 37758.880 ms (00:37.759)
tmp=> COMMIT;
COMMIT
Time: 3.922 ms

Given that d = 2 could not ever be true as the nullable, non-default-valued
column was just added inside the same transaction, I was hoping that the
index creation would be instantaneous, as it realised there's no rows that
this condition could be true for.

I definitely don't claim to be a databases expert. Is there something I'm
missing as to why this optimisation could not be put in place? If this
seems like a reasonable optimisation that could be made, is there a place
that I should post / record it for future reference / assessment by the
core developers?

Cheers,
Tim

dgrowleyml@gmail.com

almost 6 years ago

In reply to: Tim Dawborn (#1)

Re: Potential optimisation for the creation of a partial index condition over a newly created nullable, non-default-valued column?

On Thu, 4 Jun 2020 at 17:59, Tim Dawborn <tim.dawborn@gmail.com> wrote:

tmp=> \timing on
Timing is on.
tmp=> BEGIN;
BEGIN
Time: 1.333 ms
tmp=> ALTER TABLE foo ADD COLUMN d integer NULL;
ALTER TABLE
Time: 1.581 ms
tmp=> CREATE UNIQUE INDEX myindex ON foo (a, b, c) where d = 2;
CREATE INDEX
Time: 37758.880 ms (00:37.759)
tmp=> COMMIT;
COMMIT
Time: 3.922 ms

Given that d = 2 could not ever be true as the nullable, non-default-valued column was just added inside the same transaction, I was hoping that the index creation would be instantaneous, as it realised there's no rows that this condition could be true for.

While it could be technically possible to do something like check the
xmin of the pg_attribute record for all columns mentioned in the
index's predicate all are set to the current transaction ID and the
index predicate refutes an expression containing those columns with IS
NULL clauses or whatever the DEFAULT expression value is, we've just
no way to know if any rows were inserted or updated between the ALTER
TABLE and the CREATE INDEX. Certainly, no other transaction could have
done anything since we added the column due to us holding the
AccessExclusiveLock. We just don't really have a way to know if our
own transaction did anything in between. So to do this, we'd need to
invent something to track that, and that something wouldn't be free to
maintain and we'd likely need to maintain it all the time since we'd
be unable to predict what future commands might need to use that
information.

It would likely be easier to go about it by having some sort of ALTER
TABLE ADD INDEX command, then just chain the alter tables together. We
have various other optimisations when multiple subcommands are used in
a single ALTER TABLE. However, I'm not suggesting we go and allow
indexes to be created in ALTER TABLE. I"m just suggesting that it
would likely be a better alternative than inventing something to track
when a table last had

David

tim.dawborn@gmail.com

almost 6 years ago

In reply to: David Rowley (#2)

Re: Potential optimisation for the creation of a partial index condition over a newly created nullable, non-default-valued column?

On Fri, 5 Jun 2020 at 20:15, David Rowley <dgrowleyml@gmail.com> wrote:

While it could be technically possible to do something like check the
xmin of the pg_attribute record for all columns mentioned in the
index's predicate all are set to the current transaction ID and the
index predicate refutes an expression containing those columns with IS
NULL clauses or whatever the DEFAULT expression value is, we've just
no way to know if any rows were inserted or updated between the ALTER
TABLE and the CREATE INDEX. Certainly, no other transaction could have
done anything since we added the column due to us holding the
AccessExclusiveLock. We just don't really have a way to know if our
own transaction did anything in between. So to do this, we'd need to
invent something to track that, and that something wouldn't be free to
maintain and we'd likely need to maintain it all the time since we'd
be unable to predict what future commands might need to use that
information.

I see. Thanks for explaining. I'm not familiar with the internals of
Postgres. I thought there might have already been something akin to a
"dirty" bit for each table for each transaction to indicate whether or not
it had been modified in some way, which could have been used for this
hypothetical process. It sounds like that's not the case, in which case
yes, the additional overhead of this bookkeeping required for this feature
would be unjustifiable.

It would likely be easier to go about it by having some sort of ALTER
TABLE ADD INDEX command, then just chain the alter tables together. We
have various other optimisations when multiple subcommands are used in
a single ALTER TABLE. However, I'm not suggesting we go and allow
indexes to be created in ALTER TABLE. I"m just suggesting that it
would likely be a better alternative than inventing something to track
when a table last had

Nice. I hadn't considered that as a way to go about solving this problem.
Having a quick search online for ALTER TABLE CREATE INDEX syntax, it
appears that MySQL does support this notion (via ALTER TABLE ADD INDEX
<https://dev.mysql.com/doc/refman/8.0/en/alter-table.html>), though it
doesn't (currently) support partial indexes. Having a quick look in the
mailing list archives, I couldn't find any discussions about supporting
this syntax. What are the arguments against adding such additional syntax
(other than the standard issues that come with adding additional syntax)?

Cheers,
Tim