[PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Hi hackers,
It is often not feasible to use `REPLICA IDENTITY FULL` on the
publication, because it leads to full table scan
per tuple change on the subscription. This makes `REPLICA IDENTITY
FULL` impracticable -- probably other
than some small number of use cases.
With this patch, I'm proposing the following change: If there is an
index on the subscriber, use the index
as long as the planner sub-modules picks any index over sequential scan.
Majority of the logic on the subscriber side has already existed in
the code. The subscriber is already
capable of doing (unique) index scans. With this patch, we are
allowing the index to iterate over the
tuples fetched and only act when tuples are equal. The ones familiar
with this part of the code could
realize that the sequential scan code on the subscriber already
implements the `tuples_equal()` function.
In short, the changes on the subscriber are mostly combining parts of
(unique) index scan and
sequential scan codes.
The decision on whether to use an index (or which index) is mostly
derived from planner infrastructure.
The idea is that on the subscriber we have all the columns. So,
construct all the `Path`s with the
restrictions on all columns, such as `col_1 = $1 AND col_2 = $2 ...
AND col_n = $N`. Finally, let
the planner sub-module -- `create_index_paths()` -- to give us the
relevant index `Path`s. On top of
that adds the sequential scan `Path` as well. Finally, pick the
cheapest `Path` among.
From the performance point of view, there are few things to note.
First, the patch aims not to
change the behavior when PRIMARY KEY or UNIQUE INDEX is used. Second,
when REPLICA IDENTITY
IS FULL on the publisher and an index is used on the subscriber, the
difference mostly comes down
to `index scan` vs `sequential scan`. That's why it is hard to claim a
certain number of improvements.
It mostly depends on the data size, index and the data distribution.
Still, below I try to showcase the potential improvements using an
index on the subscriber
`pgbench_accounts(bid)`. With the index, the replication catches up
around ~5 seconds.
When the index is dropped, the replication takes around ~300 seconds.
// init source db
pgbench -i -s 100 -p 5432 postgres
psql -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;" -p 5432 postgres
psql -c "CREATE INDEX i1 ON pgbench_accounts(aid);" -p 5432 postgres
psql -c "ALTER TABLE pgbench_accounts REPLICA IDENTITY FULL;" -p 5432 postgres
psql -c "CREATE PUBLICATION pub_test_1 FOR TABLE pgbench_accounts;" -p
5432 postgres
// init target db, drop existing primary key
pgbench -i -p 9700 postgres
psql -c "truncate pgbench_accounts;" -p 9700 postgres
psql -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;" -p 9700 postgres
psql -c "CREATE SUBSCRIPTION sub_test_1 CONNECTION 'host=localhost
port=5432 user=onderkalaci dbname=postgres' PUBLICATION pub_test_1;"
-p 9700 postgres
// create one index, even on a low cardinality column
psql -c "CREATE INDEX i2 ON pgbench_accounts(bid);" -p 9700 postgres
// now, run some pgbench tests and observe replication
pgbench -t 500 -b tpcb-like -p 5432 postgres
What do hackers think about this change?
Thanks,
Onder Kalaci & Developing the Citus extension for PostgreSQL
Attachments:
0001_use_index_on_subs_when_pub_rep_ident_full.patchapplication/octet-stream; name=0001_use_index_on_subs_when_pub_rep_ident_full.patchDownload+800-47
On Tue, Jul 12, 2022 at 7:07 PM Önder Kalacı <onderkalaci@gmail.com> wrote:
Hi hackers,
It is often not feasible to use `REPLICA IDENTITY FULL` on the publication, because it leads to full table scan
per tuple change on the subscription. This makes `REPLICA IDENTITY FULL` impracticable -- probably other
than some small number of use cases.
IIUC, this proposal is to optimize cases where users can't have a
unique/primary key for a relation on the subscriber and those
relations receive lots of updates or deletes?
With this patch, I'm proposing the following change: If there is an index on the subscriber, use the index
as long as the planner sub-modules picks any index over sequential scan.
Majority of the logic on the subscriber side has already existed in the code. The subscriber is already
capable of doing (unique) index scans. With this patch, we are allowing the index to iterate over the
tuples fetched and only act when tuples are equal. The ones familiar with this part of the code could
realize that the sequential scan code on the subscriber already implements the `tuples_equal()` function.
In short, the changes on the subscriber are mostly combining parts of (unique) index scan and
sequential scan codes.
The decision on whether to use an index (or which index) is mostly derived from planner infrastructure.
The idea is that on the subscriber we have all the columns. So, construct all the `Path`s with the
restrictions on all columns, such as `col_1 = $1 AND col_2 = $2 ... AND col_n = $N`. Finally, let
the planner sub-module -- `create_index_paths()` -- to give us the relevant index `Path`s. On top of
that adds the sequential scan `Path` as well. Finally, pick the cheapest `Path` among.
From the performance point of view, there are few things to note. First, the patch aims not to
change the behavior when PRIMARY KEY or UNIQUE INDEX is used. Second, when REPLICA IDENTITY
IS FULL on the publisher and an index is used on the subscriber, the difference mostly comes down
to `index scan` vs `sequential scan`. That's why it is hard to claim a certain number of improvements.
It mostly depends on the data size, index and the data distribution.
It seems that in favorable cases it will improve performance but we
should consider unfavorable cases as well. Two things that come to
mind in that regard are (a) while choosing index/seq. scan paths, the
patch doesn't account for cost for tuples_equal() which needs to be
performed for index scans, (b) it appears to me that the patch decides
which index to use the first time it opens the rel (or if the rel gets
invalidated) on subscriber and then for all consecutive operations it
uses the same index. It is quite possible that after some more
operations on the table, using the same index will actually be
costlier than a sequence scan or some other index scan.
--
With Regards,
Amit Kapila.
On Mon, Jul 18, 2022 at 11:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Jul 12, 2022 at 7:07 PM Önder Kalacı <onderkalaci@gmail.com> wrote:
Hi hackers,
It is often not feasible to use `REPLICA IDENTITY FULL` on the publication, because it leads to full table scan
per tuple change on the subscription. This makes `REPLICA IDENTITY FULL` impracticable -- probably other
than some small number of use cases.
IIUC, this proposal is to optimize cases where users can't have a
unique/primary key for a relation on the subscriber and those
relations receive lots of updates or deletes?With this patch, I'm proposing the following change: If there is an index on the subscriber, use the index
as long as the planner sub-modules picks any index over sequential scan.
Majority of the logic on the subscriber side has already existed in the code. The subscriber is already
capable of doing (unique) index scans. With this patch, we are allowing the index to iterate over the
tuples fetched and only act when tuples are equal. The ones familiar with this part of the code could
realize that the sequential scan code on the subscriber already implements the `tuples_equal()` function.
In short, the changes on the subscriber are mostly combining parts of (unique) index scan and
sequential scan codes.
The decision on whether to use an index (or which index) is mostly derived from planner infrastructure.
The idea is that on the subscriber we have all the columns. So, construct all the `Path`s with the
restrictions on all columns, such as `col_1 = $1 AND col_2 = $2 ... AND col_n = $N`. Finally, let
the planner sub-module -- `create_index_paths()` -- to give us the relevant index `Path`s. On top of
that adds the sequential scan `Path` as well. Finally, pick the cheapest `Path` among.
From the performance point of view, there are few things to note. First, the patch aims not to
change the behavior when PRIMARY KEY or UNIQUE INDEX is used. Second, when REPLICA IDENTITY
IS FULL on the publisher and an index is used on the subscriber, the difference mostly comes down
to `index scan` vs `sequential scan`. That's why it is hard to claim a certain number of improvements.
It mostly depends on the data size, index and the data distribution.It seems that in favorable cases it will improve performance but we
should consider unfavorable cases as well. Two things that come to
mind in that regard are (a) while choosing index/seq. scan paths, the
patch doesn't account for cost for tuples_equal() which needs to be
performed for index scans, (b) it appears to me that the patch decides
which index to use the first time it opens the rel (or if the rel gets
invalidated) on subscriber and then for all consecutive operations it
uses the same index. It is quite possible that after some more
operations on the table, using the same index will actually be
costlier than a sequence scan or some other index scan.
Point (a) won't matter because we perform tuples_equal both for
sequence and index scans. So, we can ignore point (a).
--
With Regards,
Amit Kapila.
Hi, thanks for your reply.
Amit Kapila <amit.kapila16@gmail.com>, 18 Tem 2022 Pzt, 08:29 tarihinde
şunu yazdı:
On Tue, Jul 12, 2022 at 7:07 PM Önder Kalacı <onderkalaci@gmail.com>
wrote:Hi hackers,
It is often not feasible to use `REPLICA IDENTITY FULL` on the
publication, because it leads to full table scan
per tuple change on the subscription. This makes `REPLICA IDENTITY FULL`
impracticable -- probably other
than some small number of use cases.
IIUC, this proposal is to optimize cases where users can't have a
unique/primary key for a relation on the subscriber and those
relations receive lots of updates or deletes?
Yes, that is right.
In a similar perspective, I see this patch useful for reducing the "use
primary key/unique index" requirement to "use any index" for a reasonably
performant logical replication with updates/deletes.
It seems that in favorable cases it will improve performance but we
should consider unfavorable cases as well. Two things that come to
mind in that regard are (a) while choosing index/seq. scan paths, the
patch doesn't account for cost for tuples_equal() which needs to be
performed for index scans, (b) it appears to me that the patch decides
which index to use the first time it opens the rel (or if the rel gets
invalidated) on subscriber and then for all consecutive operations it
uses the same index. It is quite possible that after some more
operations on the table, using the same index will actually be
costlier than a sequence scan or some other index scan
Regarding (b), yes that is a concern I share. And, I was actually
considering sending another patch regarding this.
Currently, I can see two options and happy to hear your take on these (or
maybe another idea?)
- Add a new class of invalidation callbacks: Today, if we do ALTER TABLE or
CREATE INDEX on a table, the CacheRegisterRelcacheCallback helps us to
re-create the cache entries. In this case, as far as I can see, we need a
callback that is called when table "ANALYZE"d, because that is when the
statistics change. That is the time picking a new index makes sense.
However, that seems like adding another dimension to this patch, which I
can try but also see that committing becomes even harder. So, please see
the next idea as well.
- Ask users to manually pick the index they want to use: Currently, the
main complexity of the patch comes with the planner related code. In fact,
if you look into the logical replication related changes, those are
relatively modest changes. If we can drop the feature that Postgres picks
the index, and provide a user interface to set the indexes per table in the
subscription, we can probably have an easier patch to review & test. For
example, we could add `ALTER SUBSCRIPTION sub ALTER TABLE t USE INDEX i`
type of an API. This also needs some coding, but probably much simpler than
the current code. And, obviously, this pops up the question of can users
pick the right index? Probably not always, but at least that seems like a
good start to use this performance improvement.
Thoughts?
Thanks,
Onder Kalaci
On Tue, Jul 19, 2022 at 1:46 PM Önder Kalacı <onderkalaci@gmail.com> wrote:
Hi, thanks for your reply.
Amit Kapila <amit.kapila16@gmail.com>, 18 Tem 2022 Pzt, 08:29 tarihinde şunu yazdı:
IIUC, this proposal is to optimize cases where users can't have a
unique/primary key for a relation on the subscriber and those
relations receive lots of updates or deletes?Yes, that is right.
In a similar perspective, I see this patch useful for reducing the "use primary key/unique index" requirement to "use any index" for a reasonably performant logical replication with updates/deletes.
Agreed. BTW, have you seen any such requirements from users where this
will be useful for them?
It seems that in favorable cases it will improve performance but we
should consider unfavorable cases as well. Two things that come to
mind in that regard are (a) while choosing index/seq. scan paths, the
patch doesn't account for cost for tuples_equal() which needs to be
performed for index scans, (b) it appears to me that the patch decides
which index to use the first time it opens the rel (or if the rel gets
invalidated) on subscriber and then for all consecutive operations it
uses the same index. It is quite possible that after some more
operations on the table, using the same index will actually be
costlier than a sequence scan or some other index scanRegarding (b), yes that is a concern I share. And, I was actually considering sending another patch regarding this.
Currently, I can see two options and happy to hear your take on these (or maybe another idea?)
- Add a new class of invalidation callbacks: Today, if we do ALTER TABLE or CREATE INDEX on a table, the CacheRegisterRelcacheCallback helps us to re-create the cache entries. In this case, as far as I can see, we need a callback that is called when table "ANALYZE"d, because that is when the statistics change. That is the time picking a new index makes sense.
However, that seems like adding another dimension to this patch, which I can try but also see that committing becomes even harder.
This idea sounds worth investigating. I see that this will require
more work but OTOH, we can't allow the existing system to regress
especially because depending on workload it might regress badly. We
can create a patch for this atop the base patch for easier review/test
but I feel we need some way to address this point.
So, please see the next idea as well.
- Ask users to manually pick the index they want to use: Currently, the main complexity of the patch comes with the planner related code. In fact, if you look into the logical replication related changes, those are relatively modest changes. If we can drop the feature that Postgres picks the index, and provide a user interface to set the indexes per table in the subscription, we can probably have an easier patch to review & test. For example, we could add `ALTER SUBSCRIPTION sub ALTER TABLE t USE INDEX i` type of an API. This also needs some coding, but probably much simpler than the current code. And, obviously, this pops up the question of can users pick the right index?
I think picking the right index is one point and another is what if
the subscription has many tables (say 10K or more), doing it for
individual tables per subscription won't be fun. Also, users need to
identify which tables belong to a particular subscription, now, users
can find the same via pg_subscription_rel or some other way but doing
this won't be straightforward for users. So, my inclination would be
to pick the right index automatically rather than getting the input
from the user.
Now, your point related to planner code in the patch bothers me as
well but I haven't studied the patch in detail to provide any
alternatives at this stage. Do you have any other ideas to make it
simpler or solve this problem in some other way?
--
With Regards,
Amit Kapila.
On Mon, Jul 18, 2022 at 8:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
IIUC, this proposal is to optimize cases where users can't have a
unique/primary key for a relation on the subscriber and those
relations receive lots of updates or deletes?
I think this patch optimizes for all non-trivial cases of update/delete
replication (e.g. >1000 rows in the table, >1000 rows per hour updated)
without a primary key. For instance, it's quite common to have a large
append-mostly events table without a primary key (e.g. because of
partitioning, or insertion speed), which will still have occasional batch
updates/deletes.
Imagine an update of a table or partition with 1 million rows and a typical
scan speed of 1M rows/sec. An update on the whole table takes maybe 1-2
seconds. Replicating the update using a sequential scan per row can take on
the order of ~12 days ≈ 1M seconds.
The current implementation makes using REPLICA IDENTITY FULL a huge
liability/ impractical for scenarios where you want to replicate an
arbitrary set of user-defined tables, such as upgrades, migrations, shard
moves. We generally recommend users to tolerate update/delete errors in
such scenarios.
If the apply worker can use an index, the data migration tool can
tactically create one on a high cardinality column, which would practically
always be better than doing a sequential scan for non-trivial workloads.
cheers,
Marco
Hi,
- Add a new class of invalidation callbacks: Today, if we do ALTER TABLE
or CREATE INDEX on a table, the CacheRegisterRelcacheCallback helps us to
re-create the cache entries. In this case, as far as I can see, we need a
callback that is called when table "ANALYZE"d, because that is when the
statistics change. That is the time picking a new index makes sense.However, that seems like adding another dimension to this patch, which I
can try but also see that committing becomes even harder.
This idea sounds worth investigating. I see that this will require
more work but OTOH, we can't allow the existing system to regress
especially because depending on workload it might regress badly.
Just to note if that is not clear: This patch avoids (or at least aims to
avoid assuming no bugs) changing the behavior of the existing systems with
PRIMARY KEY or UNIQUE index. In that case, we still use the relevant
indexes.
We
can create a patch for this atop the base patch for easier review/test
but I feel we need some way to address this point.
One another idea could be to re-calculate the index, say after *N*
updates/deletes for the table. We may consider using subscription_parameter
for getting N -- with a good default, or even hard-code into the code. I
think the cost of re-calculating should really be pretty small compared to
the other things happening during logical replication. So, a sane default
might work?
If you think the above doesn't work, I can try to work on a separate patch
which adds something like "analyze invalidation callback".
- Ask users to manually pick the index they want to use: Currently, the
main complexity of the patch comes with the planner related code. In fact,
if you look into the logical replication related changes, those are
relatively modest changes. If we can drop the feature that Postgres picks
the index, and provide a user interface to set the indexes per table in the
subscription, we can probably have an easier patch to review & test. For
example, we could add `ALTER SUBSCRIPTION sub ALTER TABLE t USE INDEX i`
type of an API. This also needs some coding, but probably much simpler than
the current code. And, obviously, this pops up the question of can users
pick the right index?I think picking the right index is one point and another is what if
the subscription has many tables (say 10K or more), doing it for
individual tables per subscription won't be fun. Also, users need to
identify which tables belong to a particular subscription, now, users
can find the same via pg_subscription_rel or some other way but doing
this won't be straightforward for users. So, my inclination would be
to pick the right index automatically rather than getting the input
from the user.
Yes, all makes sense.
Now, your point related to planner code in the patch bothers me as
well but I haven't studied the patch in detail to provide any
alternatives at this stage. Do you have any other ideas to make it
simpler or solve this problem in some other way?
One idea I tried earlier was to go over the existing indexes and on the
table, then get the IndexInfo via BuildIndexInfo(). And then, try to find a
good heuristic to pick an index. In the end, I felt like that is doing a
sub-optimal job, requiring a similar amount of code of the current patch,
and still using the similar infrastructure.
My conclusion for that route was I should either use a very simple
heuristic (like pick the index with the most columns) and have a suboptimal
index pick, OR use a complex heuristic with a reasonable index pick. And,
the latter approach converged to the planner code in the patch. Do you
think the former approach is acceptable?
Thanks,
Onder
On Wed, Jul 20, 2022 at 8:19 PM Önder Kalacı <onderkalaci@gmail.com> wrote:
- Add a new class of invalidation callbacks: Today, if we do ALTER TABLE or CREATE INDEX on a table, the CacheRegisterRelcacheCallback helps us to re-create the cache entries. In this case, as far as I can see, we need a callback that is called when table "ANALYZE"d, because that is when the statistics change. That is the time picking a new index makes sense.
However, that seems like adding another dimension to this patch, which I can try but also see that committing becomes even harder.This idea sounds worth investigating. I see that this will require
more work but OTOH, we can't allow the existing system to regress
especially because depending on workload it might regress badly.Just to note if that is not clear: This patch avoids (or at least aims to avoid assuming no bugs) changing the behavior of the existing systems with PRIMARY KEY or UNIQUE index. In that case, we still use the relevant indexes.
We
can create a patch for this atop the base patch for easier review/test
but I feel we need some way to address this point.One another idea could be to re-calculate the index, say after N updates/deletes for the table. We may consider using subscription_parameter for getting N -- with a good default, or even hard-code into the code. I think the cost of re-calculating should really be pretty small compared to the other things happening during logical replication. So, a sane default might work?
One difficulty in deciding the value of N for the user or choosing a
default would be that we need to probably consider the local DML
operations on the table as well.
If you think the above doesn't work, I can try to work on a separate patch which adds something like "analyze invalidation callback".
I suggest we should give this a try and if this turns out to be
problematic or complex then we can think of using some heuristic as
you are suggesting above.
Now, your point related to planner code in the patch bothers me as
well but I haven't studied the patch in detail to provide any
alternatives at this stage. Do you have any other ideas to make it
simpler or solve this problem in some other way?One idea I tried earlier was to go over the existing indexes and on the table, then get the IndexInfo via BuildIndexInfo(). And then, try to find a good heuristic to pick an index. In the end, I felt like that is doing a sub-optimal job, requiring a similar amount of code of the current patch, and still using the similar infrastructure.
My conclusion for that route was I should either use a very simple heuristic (like pick the index with the most columns) and have a suboptimal index pick,
Not only that but say all index have same number of columns then we
need to probably either pick the first such index or use some other
heuristic.
OR use a complex heuristic with a reasonable index pick. And, the latter approach converged to the planner code in the patch. Do you think the former approach is acceptable?
In this regard, I was thinking in which cases a sequence scan can be
better than the index scan (considering one is available). I think if
a certain column has a lot of duplicates (for example, a column has a
boolean value) then probably doing a sequence scan is better. Now,
considering this even though your other approach sounds simpler but
could lead to unpredictable results. So, I think the latter approach
is preferable.
BTW, do we want to consider partial indexes for the scan in this
context? I mean it may not have data of all rows so how that would be
usable?
Few comments:
===============
1.
static List *
+CreateReplicaIdentityFullPaths(Relation localrel)
{
...
+ /*
+ * Rather than doing all the pushups that would be needed to use
+ * set_baserel_size_estimates, just do a quick hack for rows and width.
+ */
+ rel->rows = rel->tuples;
Is it a good idea to set rows without any selectivity estimation?
Won't this always set the entire rows in a relation? Also, if we don't
want to use set_baserel_size_estimates(), how will we compute
baserestrictcost which will later be used in the costing of paths (for
example, costing of seqscan path (cost_seqscan) uses it)?
In general, I think it will be better to consider calling some
top-level planner functions even for paths. Can we consider using
make_one_rel() instead of building individual paths? On similar lines,
in function PickCheapestIndexPathIfExists(), can we use
set_cheapest()?
2.
@@ -57,9 +60,6 @@ build_replindex_scan_key(ScanKey skey, Relation rel,
Relation idxrel,
int2vector *indkey = &idxrel->rd_index->indkey;
bool hasnulls = false;
- Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel) ||
- RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel));
You have removed this assertion but there is a comment ("This is not
generic routine, it expects the idxrel to be replication identity of a
rel and meet all limitations associated with that.") atop this
function which either needs to be changed/removed and probably we
should think if the function needs some change after removing that
restriction.
--
With Regards,
Amit Kapila.
Hi,
One another idea could be to re-calculate the index, say after N
updates/deletes for the table. We may consider using subscription_parameter
for getting N -- with a good default, or even hard-code into the code. I
think the cost of re-calculating should really be pretty small compared to
the other things happening during logical replication. So, a sane default
might work?One difficulty in deciding the value of N for the user or choosing a
default would be that we need to probably consider the local DML
operations on the table as well.
Fair enough, it is not easy to find a good default.
If you think the above doesn't work, I can try to work on a separate
patch which adds something like "analyze invalidation callback".
I suggest we should give this a try and if this turns out to be
problematic or complex then we can think of using some heuristic as
you are suggesting above.
Alright, I'll try this and respond shortly back.
OR use a complex heuristic with a reasonable index pick. And, the latter
approach converged to the planner code in the patch. Do you think the
former approach is acceptable?In this regard, I was thinking in which cases a sequence scan can be
better than the index scan (considering one is available). I think if
a certain column has a lot of duplicates (for example, a column has a
boolean value) then probably doing a sequence scan is better. Now,
considering this even though your other approach sounds simpler but
could lead to unpredictable results. So, I think the latter approach
is preferable.
Yes, it makes sense. I also considered this during the development of the
patch, but forgot to mention :)
BTW, do we want to consider partial indexes for the scan in this
context? I mean it may not have data of all rows so how that would be
usable?
As far as I can see, check_index_predicates() never picks a partial index
for the baserestrictinfos we create in CreateReplicaIdentityFullPaths().
The reason is that we have roughly the following call stack:
-check_index_predicates
--predicate_implied_by
---predicate_implied_by_recurse
----predicate_implied_by_simple_clause
-----operator_predicate_proof
And, inside operator_predicate_proof(), there is never going to be an
equality. Because, we push `Param`s to the baserestrictinfos whereas the
index predicates are always `Const`.
If we want to make it even more explicit, I can filter out `Path`s with
partial indexes. But that seems redundant to me. For now, I pushed the
commit with an assertion that we never pick partial indexes and also added
a test.
If you think it is better to explicitly filter out partial indexes, I can
do that as well.
Few comments: =============== 1. static List * +CreateReplicaIdentityFullPaths(Relation localrel) { ... + /* + * Rather than doing all the pushups that would be needed to use + * set_baserel_size_estimates, just do a quick hack for rows and width. + */ + rel->rows = rel->tuples;Is it a good idea to set rows without any selectivity estimation?
Won't this always set the entire rows in a relation? Also, if we don't
want to use set_baserel_size_estimates(), how will we compute
baserestrictcost which will later be used in the costing of paths (for
example, costing of seqscan path (cost_seqscan) uses it)?In general, I think it will be better to consider calling some
top-level planner functions even for paths. Can we consider using
make_one_rel() instead of building individual paths?
Thanks, this looks like a good suggestion/simplification. I wanted to use
the least amount of code possible, and make_one_rel() does either what I
exactly need or slightly more, which is great.
Note that make_one_rel() also follows the same call stack that I noted
above. So, I cannot spot any problems with partial indexes. Maybe am I
missing something here?
On similar lines,
in function PickCheapestIndexPathIfExists(), can we use
set_cheapest()?
Yes, make_one_rel() + set_cheapest() sounds better. Changed.
2.
@@ -57,9 +60,6 @@ build_replindex_scan_key(ScanKey skey, Relation rel,
Relation idxrel,
int2vector *indkey = &idxrel->rd_index->indkey;
bool hasnulls = false;- Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel) ||
- RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel));You have removed this assertion but there is a comment ("This is not
generic routine, it expects the idxrel to be replication identity of a
rel and meet all limitations associated with that.") atop this
function which either needs to be changed/removed and probably we
should think if the function needs some change after removing that
restriction.
Ack, I can see your point. I think, for example, we should skip index
attributes that are not simple column references. And, probably whatever
other restrictions that PRIMARY has, should be here.
I'll read some more Postgres code & test before pushing a revision for this
part. In the meantime, if you have any suggestions/pointers for me to look
into, please note here.
Attached v2 of the patch with addressing some of the comments you had. I'll
work on the remaining shortly.
Thanks,
Onder
Attachments:
v2_0001_use_index_on_subs_when_pub_rep_ident_full.patchapplication/octet-stream; name=v2_0001_use_index_on_subs_when_pub_rep_ident_full.patchDownload+867-47
Hi,
2.
@@ -57,9 +60,6 @@ build_replindex_scan_key(ScanKey skey, Relation rel,
Relation idxrel,
int2vector *indkey = &idxrel->rd_index->indkey;
bool hasnulls = false;- Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel) ||
- RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel));You have removed this assertion but there is a comment ("This is not
generic routine, it expects the idxrel to be replication identity of a
rel and meet all limitations associated with that.") atop this
function which either needs to be changed/removed and probably we
should think if the function needs some change after removing that
restriction.Ack, I can see your point. I think, for example, we should skip index
attributes that are not simple column references. And, probably whatever
other restrictions that PRIMARY has, should be here.
Primary keys require:
- Unique: We don't need uniqueness, that's the point of this patch
- Valid index: Should not be an issue in this case, because planner would
not pick non-valid index anyway.
- Non-Partial index: As discussed earlier in this thread, I really don't
see any problems with partial indexes for this use-case. Please let me know
if there is anything I miss.
- Deferrable - Immediate: As far as I can see, there is no such concepts
for regular indexes, so does not apply here
- Indexes with no expressions: This is the point where we require some
minor changes inside/around `build_replindex_scan_key `. Previously,
indexes on expressions could not be replica indexes. And, with this patch
they can. However, the expressions cannot be used for filtering the tuples
because of the way we create the restrictinfos. We essentially create
`WHERE col_1 = $1 AND col_2 = $2 .. col_n = $n` for the columns with
equality operators available. In the case of expressions on the indexes,
the planner would never pick such indexes with these restrictions. I
changed `build_replindex_scan_key ` to reflect that, added a new assert and
pushed tests with the following schema, and make sure the code behaves as
expected:
CREATE TABLE people (firstname text, lastname text);
CREATE INDEX people_names_expr_only ON people ((firstname || ' ' ||
lastname));
CREATE INDEX people_names_expr_and_columns ON people ((firstname || ' ' ||
lastname), firstname, lastname);
Also did similar tests with indexes on jsonb fields. Does that help you
avoid the concerns regarding indexes with expressions?
I'll work on one of the other open items in the thread (e.g., analyze
invalidation callback) separately.
Thanks,
Onder KALACI
Attachments:
v3_0001_use_index_on_subs_when_pub_rep_ident_full.patchapplication/octet-stream; name=v3_0001_use_index_on_subs_when_pub_rep_ident_full.patchDownload+1023-68
Hi,
As far as I can see, the following is the answer to the only remaining open
discussion in this thread. Let me know if anything is missed.
(b) it appears to me that the patch decides
which index to use the first time it opens the rel (or if the rel gets
invalidated) on subscriber and then for all consecutive operations it
uses the same index. It is quite possible that after some more
operations on the table, using the same index will actually be
costlier than a sequence scan or some other index scanRegarding (b), yes that is a concern I share. And, I was actually
considering sending another patch regarding this.
Currently, I can see two options and happy to hear your take on these
(or maybe another idea?)
- Add a new class of invalidation callbacks: Today, if we do ALTER TABLE
or CREATE INDEX on a table, the CacheRegisterRelcacheCallback helps us to
re-create the cache entries. In this case, as far as I can see, we need a
callback that is called when table "ANALYZE"d, because that is when the
statistics change. That is the time picking a new index makes sense.However, that seems like adding another dimension to this patch, which I
can try but also see that committing becomes even harder.
This idea sounds worth investigating. I see that this will require
more work but OTOH, we can't allow the existing system to regress
especially because depending on workload it might regress badly. We
can create a patch for this atop the base patch for easier review/test
but I feel we need some way to address this point.
It turns out that we already invalidate the relevant entries
in LogicalRepRelMap/LogicalRepPartMap when "ANALYZE" (or VACUUM) updates
any of the statistics in pg_class.
The call-stack for analyze is roughly:
do_analyze_rel()
-> vac_update_relstats()
-> heap_inplace_update()
-> if needs to apply any statistical change
-> CacheInvalidateHeapTuple()
And, we register for those invalidations already:
logicalrep_relmap_init() / logicalrep_partmap_init()
-> CacheRegisterRelcacheCallback()
Added a test which triggers this behavior. The test is as follows:
- Create two indexes on the target, on column_a and column_b
- Initially load data such that the column_a has a high cardinality
- Show that we use the index on column_a
- Load more data such that the column_b has higher cardinality
- ANALYZE on the target table
- Show that we use the index on column_b afterwards
Thanks,
Onder KALACI
Attachments:
v4_0001_use_index_on_subs_when_pub_rep_ident_full.patchapplication/x-patch; name=v4_0001_use_index_on_subs_when_pub_rep_ident_full.patchDownload+1165-68
On Mon, Aug 1, 2022 at 9:52 PM Önder Kalacı <onderkalaci@gmail.com> wrote:
Hi,
As far as I can see, the following is the answer to the only remaining open discussion in this thread. Let me know if anything is missed.
(b) it appears to me that the patch decides
which index to use the first time it opens the rel (or if the rel gets
invalidated) on subscriber and then for all consecutive operations it
uses the same index. It is quite possible that after some more
operations on the table, using the same index will actually be
costlier than a sequence scan or some other index scanRegarding (b), yes that is a concern I share. And, I was actually considering sending another patch regarding this.
Currently, I can see two options and happy to hear your take on these (or maybe another idea?)
- Add a new class of invalidation callbacks: Today, if we do ALTER TABLE or CREATE INDEX on a table, the CacheRegisterRelcacheCallback helps us to re-create the cache entries. In this case, as far as I can see, we need a callback that is called when table "ANALYZE"d, because that is when the statistics change. That is the time picking a new index makes sense.
However, that seems like adding another dimension to this patch, which I can try but also see that committing becomes even harder.This idea sounds worth investigating. I see that this will require
more work but OTOH, we can't allow the existing system to regress
especially because depending on workload it might regress badly. We
can create a patch for this atop the base patch for easier review/test
but I feel we need some way to address this point.It turns out that we already invalidate the relevant entries in LogicalRepRelMap/LogicalRepPartMap when "ANALYZE" (or VACUUM) updates any of the statistics in pg_class.
The call-stack for analyze is roughly:
do_analyze_rel()
-> vac_update_relstats()
-> heap_inplace_update()
-> if needs to apply any statistical change
-> CacheInvalidateHeapTuple()
Yeah, it appears that this will work but I see that we don't update
here for inherited stats, how does it work for such cases?
--
With Regards,
Amit Kapila.
On Fri, Jul 22, 2022 at 9:45 PM Önder Kalacı <onderkalaci@gmail.com> wrote:
BTW, do we want to consider partial indexes for the scan in this
context? I mean it may not have data of all rows so how that would be
usable?As far as I can see, check_index_predicates() never picks a partial index for the baserestrictinfos we create in CreateReplicaIdentityFullPaths(). The reason is that we have roughly the following call stack:
-check_index_predicates
--predicate_implied_by
---predicate_implied_by_recurse
----predicate_implied_by_simple_clause
-----operator_predicate_proofAnd, inside operator_predicate_proof(), there is never going to be an equality. Because, we push `Param`s to the baserestrictinfos whereas the index predicates are always `Const`.
I agree that the way currently baserestrictinfos are formed by patch,
it won't select the partial path, and chances are that that will be
true in future as well but I think it is better to be explicit in this
case to avoid creating a dependency between two code paths.
Few other comments:
==================
1. Why is it a good idea to choose the index selected even for the
bitmap path (T_BitmapHeapScan or T_BitmapIndexScan)? We use index scan
during update/delete, so not sure how we can conclude to use index for
bitmap paths.
2. The index info is built even on insert, so workload, where there
are no updates/deletes or those are not published then this index
selection work will go waste. Will it be better to do it at first
update/delete? One can say that it is not worth the hassle as anyway
it will be built the first time we perform an operation on the
relation or after the relation gets invalidated. If we think so, then
probably adding a comment could be useful.
3.
+my $synced_query =
+ "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT
IN ('r', 's');";
...
...
+# wait for initial table synchronization to finish
+$node_subscriber->poll_query_until('postgres', $synced_query)
+ or die "Timed out while waiting for subscriber to synchronize data";
You can avoid such instances in the test by using the new
infrastructure added in commit 0c20dd33db.
4.
LogicalRepRelation *remoterel = &root->remoterel;
+
Oid partOid = RelationGetRelid(partrel);
Spurious line addition.
--
With Regards,
Amit Kapila.
Hi,
Thanks for the feedback, see my reply below.
It turns out that we already invalidate the relevant entries in
LogicalRepRelMap/LogicalRepPartMap when "ANALYZE" (or VACUUM) updates any
of the statistics in pg_class.The call-stack for analyze is roughly:
do_analyze_rel()
-> vac_update_relstats()
-> heap_inplace_update()
-> if needs to apply any statistical change
-> CacheInvalidateHeapTuple()
Yeah, it appears that this will work but I see that we don't update
here for inherited stats, how does it work for such cases?
There, the expansion of the relation list to partitions happens one level
above on the call stack. So, the call stack looks like the following:
autovacuum_do_vac_analyze() (or ExecVacuum)
-> vacuum()
-> expand_vacuum_rel()
-> rel_list=parent+children partitions
-> for rel in rel_list
->analyze_rel()
->do_analyze_rel
... (and the same call stack as above)
I also added one variation of a similar test for partitioned tables, which
I earlier added for non-partitioned tables as well:
Added a test which triggers this behavior. The test is as follows:
- Create two indexes on the target, on column_a and column_b
- Initially load data such that the column_a has a high cardinality
- Show that we use the index on column_a on a *child *table
- Load more data such that the column_b has higher cardinality
- ANALYZE on the *parent* table
- Show that we use the index on column_b afterwards on the *child* table
My answer for the above assumes that your question is regarding what
happens if you ANALYZE on a partitioned table. If your question is
something different, please let me know.
BTW, do we want to consider partial indexes for the scan in this
context? I mean it may not have data of all rows so how that would be
usable?As far as I can see, check_index_predicates() never picks a partial
index for the baserestrictinfos we create in
CreateReplicaIdentityFullPaths(). The reason is that we have roughly the
following call stack:-check_index_predicates
--predicate_implied_by
---predicate_implied_by_recurse
----predicate_implied_by_simple_clause
-----operator_predicate_proofAnd, inside operator_predicate_proof(), there is never going to be an
equality. Because, we push `Param`s to the baserestrictinfos whereas the
index predicates are always `Const`.I agree that the way currently baserestrictinfos are formed by patch,
it won't select the partial path, and chances are that that will be
true in future as well but I think it is better to be explicit in this
case to avoid creating a dependency between two code paths.
Yes, it makes sense. So, I changed Assert into a function where we filter
partial indexes and indexes on only expressions, so that we do not create
such dependencies between the planner and here.
If one day planner supports using column values on index with expressions,
this code would only not be able to use the optimization until we do some
improvements in this code-path. I think that seems like a fair trade-off
for now.
Few other comments:
==================
1. Why is it a good idea to choose the index selected even for the
bitmap path (T_BitmapHeapScan or T_BitmapIndexScan)? We use index scan
during update/delete, so not sure how we can conclude to use index for
bitmap paths.
In our case, during update/delete we are searching for a single tuple on
the target. And, it seems like using an index is probably going to be
cheaper for finding the single tuple. In general, I thought we should use
an index if the planner ever decides to use it with the given restrictions.
Also, for the majority of the use-cases, I think we'd probably expect an
index on a column with high cardinality -- hence use index scan. So, bitmap
index scans are probably not going to be that much common.
Still, I don't see a problem with using such indexes. Of course, it is
possible that I might be missing something. Do you have any specific
concerns in this area?
2. The index info is built even on insert, so workload, where there
are no updates/deletes or those are not published then this index
selection work will go waste. Will it be better to do it at first
update/delete? One can say that it is not worth the hassle as anyway
it will be built the first time we perform an operation on the
relation or after the relation gets invalidated.
With the current approach, the index (re)-calculation is coupled with
(in)validation of the relevant cache entries. So, I'd argue for the
simplicity of the code, we could afford to waste this small overhead?
According to my local measurements, especially for large tables, the index
oid calculation is mostly insignificant compared to the rest of the steps.
Does that sound OK to you?
If we think so, then
probably adding a comment could be useful.
Yes, that is useful if you are OK with the above, added.
3. +my $synced_query = + "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');"; ... ... +# wait for initial table synchronization to finish +$node_subscriber->poll_query_until('postgres', $synced_query) + or die "Timed out while waiting for subscriber to synchronize data";You can avoid such instances in the test by using the new
infrastructure added in commit 0c20dd33db.
Cool, applied changes.
4.
LogicalRepRelation *remoterel = &root->remoterel;
+
Oid partOid = RelationGetRelid(partrel);Spurious line addition.
Fixed, went over the code and couldn't find other.
Attaching v5 of the patch which reflects the review on this email, also few
minor test improvements.
Thanks,
Onder
Attachments:
v5_0001_use_index_on_subs_when_pub_rep_ident_full.patchapplication/octet-stream; name=v5_0001_use_index_on_subs_when_pub_rep_ident_full.patchDownload+1235-68
On Tuesday, August 9, 2022 12:59 AM Önder Kalacı <onderkalaci@gmail.com> wrote:
Attaching v5 of the patch which reflects the review on this email, also few
minor test improvements.
Hi,
Thank you for the updated patch.
FYI, I noticed that v5 causes cfbot failure in [1]https://cirrus-ci.com/task/6544573026533376.
Could you please fix it in the next version ?
[19:44:38.420] execReplication.c: In function ‘RelationFindReplTupleByIndex’:
[19:44:38.420] execReplication.c:186:24: error: ‘eq’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[19:44:38.420] 186 | if (!indisunique && !tuples_equal(outslot, searchslot, eq))
[19:44:38.420] | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[19:44:38.420] cc1: all warnings being treated as errors
[1]: https://cirrus-ci.com/task/6544573026533376
Best Regards,
Takamichi Osumi
Hi,
FYI, I noticed that v5 causes cfbot failure in [1].
Could you please fix it in the next version ?
Thanks for letting me know!
[19:44:38.420] execReplication.c: In function
‘RelationFindReplTupleByIndex’:
[19:44:38.420] execReplication.c:186:24: error: ‘eq’ may be used
uninitialized in this function [-Werror=maybe-uninitialized]
[19:44:38.420] 186 | if (!indisunique && !tuples_equal(outslot,
searchslot, eq))
[19:44:38.420] |
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[19:44:38.420] cc1: all warnings being treated as errors
It is kind of interesting that the compiler cannot understand that `eq` is
only used when *!indisunique. *Anyway, now I've sent v6 where I avoid the
warning with a slight refactor to avoid the compile warning.
Thanks,
Onder KALACI
Attachments:
v6_0001_use_index_on_subs_when_pub_rep_ident_full.patchapplication/octet-stream; name=v6_0001_use_index_on_subs_when_pub_rep_ident_full.patchDownload+1241-68
On Mon, Aug 8, 2022 at 9:29 PM Önder Kalacı <onderkalaci@gmail.com> wrote:
Hi,
Thanks for the feedback, see my reply below.
It turns out that we already invalidate the relevant entries in LogicalRepRelMap/LogicalRepPartMap when "ANALYZE" (or VACUUM) updates any of the statistics in pg_class.
The call-stack for analyze is roughly:
do_analyze_rel()
-> vac_update_relstats()
-> heap_inplace_update()
-> if needs to apply any statistical change
-> CacheInvalidateHeapTuple()Yeah, it appears that this will work but I see that we don't update
here for inherited stats, how does it work for such cases?There, the expansion of the relation list to partitions happens one level above on the call stack. So, the call stack looks like the following:
autovacuum_do_vac_analyze() (or ExecVacuum)
-> vacuum()
-> expand_vacuum_rel()
-> rel_list=parent+children partitions
-> for rel in rel_list
->analyze_rel()
->do_analyze_rel
... (and the same call stack as above)I also added one variation of a similar test for partitioned tables, which I earlier added for non-partitioned tables as well:
Added a test which triggers this behavior. The test is as follows:
- Create two indexes on the target, on column_a and column_b
- Initially load data such that the column_a has a high cardinality
- Show that we use the index on column_a on a child table
- Load more data such that the column_b has higher cardinality
- ANALYZE on the parent table
- Show that we use the index on column_b afterwards on the child tableMy answer for the above assumes that your question is regarding what happens if you ANALYZE on a partitioned table. If your question is something different, please let me know.
I was talking about inheritance cases, something like:
create table tbl1 (a int);
create table tbl1_part1 (b int) inherits (tbl1);
create table tbl1_part2 (c int) inherits (tbl1);
What we do in such cases is documented as: "if the table being
analyzed has inheritance children, ANALYZE gathers two sets of
statistics: one on the rows of the parent table only, and a second
including rows of both the parent table and all of its children. This
second set of statistics is needed when planning queries that process
the inheritance tree as a whole. The child tables themselves are not
individually analyzed in this case."
Now, the point I was worried about was what if the changes in child
tables (*_part1, *_part2) are much more than in tbl1? In such cases,
we may not invalidate child rel entries, so how will logical
replication behave for updates/deletes on child tables? There may not
be any problem here but it is better to do some analysis of such cases
to see how it behaves.
BTW, do we want to consider partial indexes for the scan in this
context? I mean it may not have data of all rows so how that would be
usable?Few other comments:
==================
1. Why is it a good idea to choose the index selected even for the
bitmap path (T_BitmapHeapScan or T_BitmapIndexScan)? We use index scan
during update/delete, so not sure how we can conclude to use index for
bitmap paths.In our case, during update/delete we are searching for a single tuple on the target. And, it seems like using an index is probably going to be cheaper for finding the single tuple. In general, I thought we should use an index if the planner ever decides to use it with the given restrictions.
What about the case where the index has a lot of duplicate values? We
may need to retrieve multiple tuples in such cases.
Also, for the majority of the use-cases, I think we'd probably expect an index on a column with high cardinality -- hence use index scan. So, bitmap index scans are probably not going to be that much common.
You are probably right here but I don't think we can make such
assumptions. I think the safest way to avoid any regression here is to
choose an index when the planner selects an index scan. We can always
extend it later to bitmap scans if required. We can add a comment
indicating the same.
Still, I don't see a problem with using such indexes. Of course, it is possible that I might be missing something. Do you have any specific concerns in this area?
2. The index info is built even on insert, so workload, where there
are no updates/deletes or those are not published then this index
selection work will go waste. Will it be better to do it at first
update/delete? One can say that it is not worth the hassle as anyway
it will be built the first time we perform an operation on the
relation or after the relation gets invalidated.With the current approach, the index (re)-calculation is coupled with (in)validation of the relevant cache entries. So, I'd argue for the simplicity of the code, we could afford to waste this small overhead? According to my local measurements, especially for large tables, the index oid calculation is mostly insignificant compared to the rest of the steps. Does that sound OK to you?
If we think so, then
probably adding a comment could be useful.Yes, that is useful if you are OK with the above, added.
*
+ /*
+ * For insert-only workloads, calculating the index is not necessary.
+ * As the calculation is not expensive, we are fine to do here (instead
+ * of during first update/delete processing).
+ */
I think here instead of talking about cost, we should mention that it
is quite an infrequent operation i.e performed only when we first time
performs an operation on the relation or after invalidation. This is
because I think the cost is relative.
*
+
+ /*
+ * Although currently it is not possible for planner to pick a
+ * partial index or indexes only on expressions,
It may be better to expand this comment by describing a bit why it is
not possible in our case. You might want to give the function
reference where it is decided.
--
With Regards,
Amit Kapila.
Hi,
I'm a little late to catch up with your comments, but here are my replies:
My answer for the above assumes that your question is regarding what
happens if you ANALYZE on a partitioned table. If your question is
something different, please let me know.I was talking about inheritance cases, something like:
create table tbl1 (a int);
create table tbl1_part1 (b int) inherits (tbl1);
create table tbl1_part2 (c int) inherits (tbl1);What we do in such cases is documented as: "if the table being
analyzed has inheritance children, ANALYZE gathers two sets of
statistics: one on the rows of the parent table only, and a second
including rows of both the parent table and all of its children. This
second set of statistics is needed when planning queries that process
the inheritance tree as a whole. The child tables themselves are not
individually analyzed in this case."
Oh, I haven't considered inherited tables. That seems right, the
statistics of the children are not updated when the parent is analyzed.
Now, the point I was worried about was what if the changes in child
tables (*_part1, *_part2) are much more than in tbl1? In such cases,
we may not invalidate child rel entries, so how will logical
replication behave for updates/deletes on child tables? There may not
be any problem here but it is better to do some analysis of such cases
to see how it behaves.
I also haven't observed any specific issues. In the end, when the user (or
autovacuum) does ANALYZE on the child, it is when the statistics are
updated for the child. Although I do not have much experience with
inherited tables, this sounds like the expected behavior?
I also pushed a test covering inherited tables. First, a basic test on the
parent. Then, show that updates on the parent can also use indexes of the
children. Also, after an ANALYZE on the child, we can re-calculate the
index and use the index with a higher cardinality column.
Also, for the majority of the use-cases, I think we'd probably expect an
index on a column with high cardinality -- hence use index scan. So, bitmap
index scans are probably not going to be that much common.You are probably right here but I don't think we can make such
assumptions. I think the safest way to avoid any regression here is to
choose an index when the planner selects an index scan. We can always
extend it later to bitmap scans if required. We can add a comment
indicating the same.
Alright, I got rid of the bitmap scans.
Though, it caused few of the new tests to fail. I think because of the data
size/distribution, the planner picks bitmap scans. To make the tests
consistent and small, I added `enable_bitmapscan to off` for this new test
file. Does that sound ok to you? Or, should we change the tests to make
sure they genuinely use index scans?
*
+ /* + * For insert-only workloads, calculating the index is not necessary. + * As the calculation is not expensive, we are fine to do here (instead + * of during first update/delete processing). + */I think here instead of talking about cost, we should mention that it
is quite an infrequent operation i.e performed only when we first time
performs an operation on the relation or after invalidation. This is
because I think the cost is relative.
Changed, does that look better?
+
+ /* + * Although currently it is not possible for planner to pick a + * partial index or indexes only on expressions,It may be better to expand this comment by describing a bit why it is
not possible in our case. You might want to give the function
reference where it is decided.Make sense, added some more information.
Thanks,
Onder
Attachments:
v7_0001_use_index_on_subs_when_pub_rep_ident_full.patchapplication/octet-stream; name=v7_0001_use_index_on_subs_when_pub_rep_ident_full.patchDownload+1359-68
On Sat, Aug 20, 2022 7:02 PM Önder Kalacı <onderkalaci@gmail.com> wrote:
Hi,
I'm a little late to catch up with your comments, but here are my replies:
Thanks for your patch. Here are some comments.
1.
In FilterOutNotSuitablePathsForReplIdentFull(), is "nonPartialIndexPathList" a
good name for the list? Indexes on only expressions are also be filtered.
+static List *
+FilterOutNotSuitablePathsForReplIdentFull(List *pathlist)
+{
+ ListCell *lc;
+ List *nonPartialIndexPathList = NIL;
2.
+typedef struct LogicalRepPartMapEntry
+{
+ Oid partoid; /* LogicalRepPartMap's key */
+ LogicalRepRelMapEntry relmapentry;
+ Oid usableIndexOid; /* which index to use? (Invalid when no index
+ * used) */
+} LogicalRepPartMapEntry;
For partition tables, is it possible to use relmapentry->usableIndexOid to mark
which index to use? Which means we don't need to add "usableIndexOid" to
LogicalRepPartMapEntry.
3.
It looks we should change the comment for FindReplTupleInLocalRel() in this
patch.
/*
* Try to find a tuple received from the publication side (in 'remoteslot') in
* the corresponding local relation using either replica identity index,
* primary key or if needed, sequential scan.
*
* Local tuple, if found, is returned in '*localslot'.
*/
static bool
FindReplTupleInLocalRel(EState *estate, Relation localrel,
4.
@@ -2030,16 +2017,19 @@ apply_handle_delete_internal(ApplyExecutionData *edata,
{
EState *estate = edata->estate;
Relation localrel = relinfo->ri_RelationDesc;
- LogicalRepRelation *remoterel = &edata->targetRel->remoterel;
+ LogicalRepRelMapEntry *targetRel = edata->targetRel;
+ LogicalRepRelation *remoterel = &targetRel->remoterel;
EPQState epqstate;
TupleTableSlot *localslot;
Do we need this change? I didn't see any place to use the variable targetRel
afterwards.
5.
+ if (!AttributeNumberIsValid(mainattno))
+ {
+ /*
+ * There are two cases to consider. First, if the index is a primary or
+ * unique key, we cannot have any indexes with expressions. So, at this
+ * point we are sure that the index we deal is not these.
+ */
+ Assert(RelationGetReplicaIndex(rel) != RelationGetRelid(idxrel) &&
+ RelationGetPrimaryKeyIndex(rel) != RelationGetRelid(idxrel));
+
+ /*
+ * For a non-primary/unique index with an expression, we are sure that
+ * the expression cannot be used for replication index search. The
+ * reason is that we create relevant index paths by providing column
+ * equalities. And, the planner does not pick expression indexes via
+ * column equality restrictions in the query.
+ */
+ continue;
+ }
Is it possible that it is a usable index with an expression? I think indexes
with an expression has been filtered in
FilterOutNotSuitablePathsForReplIdentFull(). If it can't be a usable index with
an expression, maybe we shouldn't use "continue" here.
6.
In the following case, I got a result which is different from HEAD, could you
please look into it?
-- publisher
CREATE TABLE test_replica_id_full (x int);
ALTER TABLE test_replica_id_full REPLICA IDENTITY FULL;
CREATE PUBLICATION tap_pub_rep_full FOR TABLE test_replica_id_full;
-- subscriber
CREATE TABLE test_replica_id_full (x int, y int);
CREATE INDEX test_replica_id_full_idx ON test_replica_id_full(x,y);
CREATE SUBSCRIPTION tap_sub_rep_full_0 CONNECTION 'dbname=postgres port=5432' PUBLICATION tap_pub_rep_full;
-- publisher
INSERT INTO test_replica_id_full VALUES (1);
UPDATE test_replica_id_full SET x = x + 1 WHERE x = 1;
The data in subscriber:
on HEAD:
postgres=# select * from test_replica_id_full ;
x | y
---+---
2 |
(1 row)
After applying the patch:
postgres=# select * from test_replica_id_full ;
x | y
---+---
1 |
(1 row)
Regards,
Shi yu
Hi,
Thanks for the review!
1.
In FilterOutNotSuitablePathsForReplIdentFull(), is
"nonPartialIndexPathList" a
good name for the list? Indexes on only expressions are also be filtered.+static List * +FilterOutNotSuitablePathsForReplIdentFull(List *pathlist) +{ + ListCell *lc; + List *nonPartialIndexPathList = NIL;
Yes, true. We only started filtering the non-partial ones first. Now
changed to *suitableIndexList*, does that look right?
2. +typedef struct LogicalRepPartMapEntry +{ + Oid partoid; /* LogicalRepPartMap's key */ + LogicalRepRelMapEntry relmapentry; + Oid usableIndexOid; /* which index to use? (Invalid when no index + * used) */ +} LogicalRepPartMapEntry;For partition tables, is it possible to use relmapentry->usableIndexOid to
mark
which index to use? Which means we don't need to add "usableIndexOid" to
LogicalRepPartMapEntry.
My intention was to make this explicit so that it is clear that partitions
can explicitly own indexes.
But I tried your suggested refactor, which looks good. So, I changed it.
Also, I realized that I do not have a test where the partition has an index
(not inherited from the parent), which I also added now.
3.
It looks we should change the comment for FindReplTupleInLocalRel() in this
patch./*
* Try to find a tuple received from the publication side (in
'remoteslot') in
* the corresponding local relation using either replica identity index,
* primary key or if needed, sequential scan.
*
* Local tuple, if found, is returned in '*localslot'.
*/
static bool
FindReplTupleInLocalRel(EState *estate, Relation localrel,
I made a small change, just adding "index". Do you expect a larger change?
4. @@ -2030,16 +2017,19 @@ apply_handle_delete_internal(ApplyExecutionData *edata, { EState *estate = edata->estate; Relation localrel = relinfo->ri_RelationDesc; - LogicalRepRelation *remoterel = &edata->targetRel->remoterel; + LogicalRepRelMapEntry *targetRel = edata->targetRel; + LogicalRepRelation *remoterel = &targetRel->remoterel; EPQState epqstate; TupleTableSlot *localslot;Do we need this change? I didn't see any place to use the variable
targetRel
afterwards.
Seems so, changed it back.
5. + if (!AttributeNumberIsValid(mainattno)) + { + /* + * There are two cases to consider. First, if the index is a primary or + * unique key, we cannot have any indexes with expressions. So, at this + * point we are sure that the index we deal is not these. + */ + Assert(RelationGetReplicaIndex(rel) != RelationGetRelid(idxrel) && + RelationGetPrimaryKeyIndex(rel) != RelationGetRelid(idxrel)); + + /* + * For a non-primary/unique index with an expression, we are sure that + * the expression cannot be used for replication index search. The + * reason is that we create relevant index paths by providing column + * equalities. And, the planner does not pick expression indexes via + * column equality restrictions in the query. + */ + continue; + }Is it possible that it is a usable index with an expression? I think
indexes
with an expression has been filtered in
FilterOutNotSuitablePathsForReplIdentFull(). If it can't be a usable index
with
an expression, maybe we shouldn't use "continue" here.
Ok, I think there are some confusing comments in the code, which I updated.
Also, added one more explicit Assert to make the code a little more
readable.
We can support indexes involving expressions but not indexes that are only
consisting of expressions. FilterOutNotSuitablePathsForReplIdentFull()
filters out the latter, see IndexOnlyOnExpression().
So, for example, if we have an index as below, we are skipping the
expression while building the index scan keys:
CREATE INDEX people_names ON people (firstname, lastname, (id || '_' ||
sub_id));
We can consider removing `continue`, but that'd mean we should also adjust
the following code-block to handle indexprs. To me, that seems like an edge
case to implement at this point, given such an index is probably not
common. Do you think should I try to use the indexprs as well while
building the scan key?
I'm mostly trying to keep the complexity small. If you suggest this
limitation should be lifted, I can give it a shot. I think the limitation I
leave here is with a single sentence: *The index on the subscriber can only
use simple column references. *
6.
In the following case, I got a result which is different from HEAD, could
you
please look into it?-- publisher
CREATE TABLE test_replica_id_full (x int);
ALTER TABLE test_replica_id_full REPLICA IDENTITY FULL;
CREATE PUBLICATION tap_pub_rep_full FOR TABLE test_replica_id_full;-- subscriber
CREATE TABLE test_replica_id_full (x int, y int);
CREATE INDEX test_replica_id_full_idx ON test_replica_id_full(x,y);
CREATE SUBSCRIPTION tap_sub_rep_full_0 CONNECTION 'dbname=postgres
port=5432' PUBLICATION tap_pub_rep_full;-- publisher
INSERT INTO test_replica_id_full VALUES (1);
UPDATE test_replica_id_full SET x = x + 1 WHERE x = 1;The data in subscriber:
on HEAD:
postgres=# select * from test_replica_id_full ;
x | y
---+---
2 |
(1 row)After applying the patch:
postgres=# select * from test_replica_id_full ;
x | y
---+---
1 |
(1 row)
Ops, good catch. it seems we forgot to have:
skey[scankey_attoff].sk_flags |= SK_SEARCHNULL;
On head, the index used for this purpose could only be the primary key or
unique key on NOT NULL columns. Now, we do allow NULL values, and need to
search for them. Added that (and your test) to the updated patch.
As a semi-related note, tuples_equal() decides `true` for (NULL = NULL). I
have not changed that, and it seems right in this context. Do you see any
issues with that?
Also, I realized that the functions in the execReplication.c expect only
btree indexes. So, I skipped others as well. If that makes sense, I can
work on a follow-up patch after we can merge this, to remove some of the
limitations mentioned here.
Thanks,
Onder