Re: 回复:how to create index concurrently on partitioned table

Started by Justin Pryzbyover 5 years ago25 messages
#1Justin Pryzby
Justin Pryzby
pryzby@telsasoft.com
3 attachment(s)

On Mon, Jun 15, 2020 at 09:37:42PM +0900, Michael Paquier wrote:

On Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote:

As shown above, an error occurred while creating an index in the second partition.
It can be clearly seen that the index of the partitioned table is invalid
and the index of the first partition is normal, the second partition is invalid,
and the Third Partition index does not exist at all.

That's a problem. I really think that we should make the steps of the
concurrent operation consistent across all relations, meaning that all
the indexes should be created as invalid for all the parts of the
partition tree, including partitioned tables as well as their
partitions, in the same transaction. Then a second new transaction
gets used for the index build, followed by a third one for the
validation that switches the indexes to become valid.

Note that the mentioned problem wasn't serious: there was missing index on
child table, therefor the parent index was invalid, as intended. However I
agree that it's not nice that the command can fail so easily and leave behind
some indexes created successfully and some failed some not created at all.

But I took your advice initially creating invalid inds.

On Tue, Jun 16, 2020 at 10:02:21AM +0900, Michael Paquier wrote:

CIC is an operation that exists while allowing read and writes to
still happen in parallel, so that's fine IMO if it takes time. Now it
is true that it may not scale well so we should be careful with the
approach taken. Also, I think that the case of REINDEX would require
less work overall because we already have some code in place to gather
multiple indexes from one or more relations and work on these
consistently, all at once.

I'm not sure if by reindex you mean my old 0002. That's now 0001, so if it can
be simplified somehow, that's great..

That gave me the idea to layer CIC on top of Reindex, since I think it does
exactly what's needed.

--
Justin

Attachments:

v4-0001-Implement-REINDEX-of-partitioned-tables-indexes.patchtext/x-diff; charset=us-ascii
v4-0002-Allow-CREATE-INDEX-CONCURRENTLY-on-partitioned-ta.patchtext/x-diff; charset=us-ascii
v4-0003-Implement-CLUSTER-of-partitioned-table.patchtext/x-diff; charset=us-ascii
#2Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#1)

On Sat, Aug 08, 2020 at 01:37:44AM -0500, Justin Pryzby wrote:

That gave me the idea to layer CIC on top of Reindex, since I think it does
exactly what's needed.

For now, I would recommend to focus first on 0001 to add support for
partitioned tables and indexes to REINDEX. CIC is much more
complicated btw, but I am not entering in the details now.

-   /*
-    * This may be useful when implemented someday; but that day is not today.
-    * For now, avoid erroring out when called in a multi-table context
-    * (REINDEX SCHEMA) and happen to come across a partitioned table.  The
-    * partitions may be reindexed on their own anyway.
-    */
+   /* Avoid erroring out */
    if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
    {
This comment does not help, and actually this becomes incorrect as
reindex for this relkind becomes supported once 0001 is done.
+       case RELKIND_INDEX:
+           reindex_index(inhrelid, false, get_rel_persistence(inhrelid),
+                         options | REINDEXOPT_REPORT_PROGRESS);
+           break;
+       case RELKIND_RELATION:
+           (void) reindex_relation(inhrelid,
+                                     REINDEX_REL_PROCESS_TOAST |
+                                     REINDEX_REL_CHECK_CONSTRAINTS,
+                                     options | REINDEXOPT_REPORT_PROGRESS);
ReindexPartitionedRel() fails to consider the concurrent case here for
partition indexes and tables, as reindex_index()/reindex_relation()
are the APIs used in the non-concurrent case.  Once you consider the
concurrent case correctly, we also need to be careful with partitions
that have a temporary persistency (note that we don't allow partition
trees to mix persistency types, all partitions have to be temporary or
permanent).

I think that you are right to make the entry point to handle
partitioned index in ReindexIndex() and partitioned table in
ReindexTable(), but the structure of the patch should be different:
- The second portion of ReindexMultipleTables() should be moved into a
separate routine, taking in input a list of relation OIDs. This needs
to be extended a bit so as reindex_index() gets called for an index
relkind if the relpersistence is temporary or if we have a
non-concurrent reindex. The idea is that we finish with a single code
path able to work on a list of relations. And your patch adds more of
that as of ReindexPartitionedRel().
- We should *not* handle directly partitioned index and/or table in
ReindexRelationConcurrently() to not complicate the logic where we
gather all the indexes of a table/matview. So I think that the list
of partition indexes/tables to work on should be built directly in
ReindexIndex() and ReindexTable(), and then this should call the
second part of ReindexMultipleTables() refactored in the previous
point. This way, each partition index gets done individually in its
own transaction. For a partition table, all indexes of this partition
are rebuilt in the same set of transactions. For the concurrent case,
we have already reindex_concurrently_swap that it able to switch the
dependencies of two indexes within a partition tree, so we can rely on
that so as a failure in the middle of the operation never leaves the
a partition structure in an inconsistent state.
--
Michael

#3Justin Pryzby
Justin Pryzby
pryzby@telsasoft.com
In reply to: Michael Paquier (#2)
3 attachment(s)

Thanks for looking.

On Sun, Aug 09, 2020 at 02:00:09PM +0900, Michael Paquier wrote:

exactly what's needed.

For now, I would recommend to focus first on 0001 to add support for
partitioned tables and indexes to REINDEX. CIC is much more
complicated btw, but I am not entering in the details now.

+ /* Avoid erroring out */
if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
This comment does not help, and actually this becomes incorrect as
reindex for this relkind becomes supported once 0001 is done.

I made a minimal change to avoid forgetting to eventually change that part.

ReindexPartitionedRel() fails to consider the concurrent case here for
partition indexes and tables, as reindex_index()/reindex_relation()
are the APIs used in the non-concurrent case. Once you consider the
concurrent case correctly, we also need to be careful with partitions
that have a temporary persistency (note that we don't allow partition
trees to mix persistency types, all partitions have to be temporary or
permanent).

Fixed.

I think that you are right to make the entry point to handle
partitioned index in ReindexIndex() and partitioned table in
ReindexTable(), but the structure of the patch should be different:
- The second portion of ReindexMultipleTables() should be moved into a
separate routine, taking in input a list of relation OIDs. This needs
to be extended a bit so as reindex_index() gets called for an index
relkind if the relpersistence is temporary or if we have a
non-concurrent reindex. The idea is that we finish with a single code
path able to work on a list of relations. And your patch adds more of
that as of ReindexPartitionedRel().

It's a good idea.

- We should *not* handle directly partitioned index and/or table in
ReindexRelationConcurrently() to not complicate the logic where we
gather all the indexes of a table/matview. So I think that the list
of partition indexes/tables to work on should be built directly in
ReindexIndex() and ReindexTable(), and then this should call the
second part of ReindexMultipleTables() refactored in the previous
point.

I think I addressed these mostly as you intended.

This way, each partition index gets done individually in its
own transaction. For a partition table, all indexes of this partition
are rebuilt in the same set of transactions. For the concurrent case,
we have already reindex_concurrently_swap that it able to switch the
dependencies of two indexes within a partition tree, so we can rely on
that so as a failure in the middle of the operation never leaves the
a partition structure in an inconsistent state.

--
Justin

Attachments:

v5-0001-Implement-REINDEX-of-partitioned-tables-indexes.patchtext/x-diff; charset=us-ascii
v5-0002-Allow-CREATE-INDEX-CONCURRENTLY-on-partitioned-ta.patchtext/x-diff; charset=us-ascii
v5-0003-Implement-CLUSTER-of-partitioned-table.patchtext/x-diff; charset=us-ascii
#4Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#3)
1 attachment(s)

On Sun, Aug 09, 2020 at 06:44:23PM -0500, Justin Pryzby wrote:

On Sun, Aug 09, 2020 at 02:00:09PM +0900, Michael Paquier wrote:

For now, I would recommend to focus first on 0001 to add support for
partitioned tables and indexes to REINDEX. CIC is much more
complicated btw, but I am not entering in the details now.

+ /* Avoid erroring out */
if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
This comment does not help, and actually this becomes incorrect as
reindex for this relkind becomes supported once 0001 is done.

I made a minimal change to avoid forgetting to eventually change
that part.

Why not changing it then? We already filter out per relkind in all
the code paths calling reindex_relation(), be it in indexcmds.c for
schema-level reindex or even tablecmds.c, so I have switched this part
to an elog().

- We should *not* handle directly partitioned index and/or table in
ReindexRelationConcurrently() to not complicate the logic where we
gather all the indexes of a table/matview. So I think that the list
of partition indexes/tables to work on should be built directly in
ReindexIndex() and ReindexTable(), and then this should call the
second part of ReindexMultipleTables() refactored in the previous
point.

I think I addressed these mostly as you intended.

Mostly. I have been hacking on this patch, and basically rewrote it
as the attached. The handling of the memory context used to keep the
list of partitions intact across transactions was rather clunky: the
context was not reset when we are done, and we would call more APIs
than necessary while switching to it, like find_all_inheritors() which
could do much more allocations. I have fixed that by minimizing the
areas where the private context is used, switching to it only when
saving a new OID in the list of partitions, or a session lock (see
below for this part).

While on it, I found that the test coverage was not enough, so I have
extended the set of tests to make sure any concurrent and
non-concurrent operation for partitioned tables and indexes change the
correct set of relfilenodes for each operation. I have written some
custom functions to minimize the duplication (the whole thing cannot
be grouped as those commands cannot run in a transaction block).

Speaking of which, the patch missed that REINDEX INDEX/TABLE should
not run in a transaction block when working on a partitioned
relation. And the documentation needs to be clear about the
limitation of each operation, so I have written more about all that.
The patch also has commented out areas with slashes or such, and I
have added some elog() and some asserts to make sure that we don't
cross any areas that should not work with partitioned relations.

While hacking on this patch, I have found an old bug in the REINDEX
logic: we build a list of relations to reindex in
ReindexMultipleTables() for schema and database reindexes, but it
happens that we don't recheck if the relations listed actually exists
or not, so dropping a relation during a large reindex can cause
sparse failures because of relations that cannot be found anymore. In
the case of this thread, the problem is different though (the proposed
patch was full of holes regarding that) and we need to use session
locks on the parent *table* partitions (not the indexes) to avoid any
issues within the first transaction building the list of relations to
work on, similarly to REINDEX CONCURRENTLY. So I fixed this problem
this way. For the schema and database cases, I think that we would
need to do something similar to VACUUM, aka have an extra code path to
skip relations not defined. I'll leave that for another thread.

One last thing. I think that the patch is in a rather good shape, but
there is one error message I am not happy with when running some
commands in a transaction block. Say, this sequence:
CREATE TABLE parent_tab (id int) PARTITION BY RANGE (id);
CREATE INDEX parent_index ON parent_tab (id);
BEGIN;
REINDEX INDEX parent_index; -- error
ERROR: 25001: REINDEX INDEX cannot run inside a transaction block
LOCATION: PreventInTransactionBlock, xact.c:3386

This error can be confusing, because we don't tell directly that the
relation involved here is partitioned, and REINDEX INDEX/TABLE are
fine when doing their stuff on non-partitions. For other code paths,
we have leveraged such errors to use the grammar specific to
partitions, for example "CREATE TABLE .. PARTITION OF" or such as
these don't cause translation issues, but we don't have a specific
syntax of REINDEX for partitioned relations, and I don't think that we
need more grammar just for that. The simplest idea I have here is to
just use an error callback to set an errcontext(), saying roughly:
"while reindexing partitioned table/index %s" while we go through
PreventInTransactionBlock(). I have done nothing about that yet but
adding an errcallback is simple enough. Perhaps somebody has a
different idea here?
--
Michael

Attachments:

reindex-part-v6.patchtext/x-diff; charset=us-ascii
#5Justin Pryzby
Justin Pryzby
pryzby@telsasoft.com
In reply to: Michael Paquier (#4)

Thanks for helping with this.

On Wed, Aug 12, 2020 at 01:54:38PM +0900, Michael Paquier wrote:

+++ b/src/backend/catalog/index.c
@@ -3661,20 +3662,12 @@ reindex_relation(Oid relid, int flags, int options)
+		elog(ERROR, "unsupported relation kind for relation \"%s\"",
+			 RelationGetRelationName(rel));

I guess it should show the relkind(%c) in the message, like these:

src/backend/commands/tablecmds.c: elog(ERROR, "unexpected relkind: %d", (int) relkind);
src/backend/tcop/utility.c: elog(ERROR, "unexpected relkind \"%c\" on partition \"%s\"",

ISTM reindex_index is missing that, too:

8b08f7d4820fd7a8ef6152a9dd8c6e3cb01e5f99
+       if (iRel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
+               elog(ERROR, "unsupported relation kind for index \"%s\"",
+                        RelationGetRelationName(iRel));
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
@@ -259,8 +263,12 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { IN
</para>
<para>
-   Reindexing partitioned tables or partitioned indexes is not supported.
-   Each individual partition can be reindexed separately instead.
+   Reindexing partitioned indexes or partitioned tables is supported
+   with respectively <command>REINDEX INDEX</command> or
+   <command>REINDEX TABLE</command>.

Should say "..with REINDEX INDEX or REINDEX TABLE, respectively".

+ Each partition of the partitioned
+   relation defined is rebuilt in its own transaction.

=> Each partition of the specified partitioned relation is reindexed in a
separate transaction.

--
Justin

#6Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#5)
1 attachment(s)

On Wed, Aug 12, 2020 at 12:28:20AM -0500, Justin Pryzby wrote:

On Wed, Aug 12, 2020 at 01:54:38PM +0900, Michael Paquier wrote:

+++ b/src/backend/catalog/index.c
@@ -3661,20 +3662,12 @@ reindex_relation(Oid relid, int flags, int options)
+		elog(ERROR, "unsupported relation kind for relation \"%s\"",
+			 RelationGetRelationName(rel));

I guess it should show the relkind(%c) in the message, like these:

Sure, but I don't see much the point in adding the relkind here
knowing that we know which one it is.

ISTM reindex_index is missing that, too:

8b08f7d4820fd7a8ef6152a9dd8c6e3cb01e5f99
+       if (iRel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
+               elog(ERROR, "unsupported relation kind for index \"%s\"",
+                        RelationGetRelationName(iRel));

The error string does not follow the usual project style either, so I
have updated both.

<para>
-   Reindexing partitioned tables or partitioned indexes is not supported.
-   Each individual partition can be reindexed separately instead.
+   Reindexing partitioned indexes or partitioned tables is supported
+   with respectively <command>REINDEX INDEX</command> or
+   <command>REINDEX TABLE</command>.

Should say "..with REINDEX INDEX or REINDEX TABLE, respectively".

+ Each partition of the partitioned
+   relation defined is rebuilt in its own transaction.

=> Each partition of the specified partitioned relation is reindexed in a
separate transaction.

Thanks, good idea.

I have been able to work more on this patch today, and finally added
an error context for the transaction block check, as that's cleaner.
In my manual tests, I have also bumped into a case that failed with
the original patch (where there were no session locks), and created
an isolation test based on it: drop of a partition leaf concurrently
to REINDEX done on the parent. One last thing I have spotted is that
we failed to discard properly foreign tables defined as leaves of a
partition tree, causing a reindex to fail, so reindex_partitions()
ought to just use RELKIND_HAS_STORAGE() to do its filtering work. I
am leaving this patch alone for a couple of days now, and I'll try to
come back to it after and potentially commit it. The attached has
been indented by the way.
--
Michael

Attachments:

reindex-part-v7.patchtext/x-diff; charset=us-ascii
#7Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#6)

On Wed, Aug 12, 2020 at 10:37:08PM +0900, Michael Paquier wrote:

I have been able to work more on this patch today, and finally added
an error context for the transaction block check, as that's cleaner.
In my manual tests, I have also bumped into a case that failed with
the original patch (where there were no session locks), and created
an isolation test based on it: drop of a partition leaf concurrently
to REINDEX done on the parent. One last thing I have spotted is that
we failed to discard properly foreign tables defined as leaves of a
partition tree, causing a reindex to fail, so reindex_partitions()
ought to just use RELKIND_HAS_STORAGE() to do its filtering work. I
am leaving this patch alone for a couple of days now, and I'll try to
come back to it after and potentially commit it. The attached has
been indented by the way.

I got to think more about the locking strategy used in this patch, and
I am afraid that we should fix the bug with REINDEX SCHEMA/DATABASE
first. What we have here is rather similar to a REINDEX SCHEMA with
all the partitions on the same schema, so it would be better to apply
the same set of assumptions and logic for the reindex of partitioned
relations as we do for the others. This makes the whole logic more
consistent, but it also reduces the surface of bugs. I have created a
separate thread for the problem, and posted a patch:
/messages/by-id/20200813043805.GE11663@paquier.xyz

Once this gets done, we should then be able to get rid of the extra
session locking taken when building the list of partitions, limiting
session locks to only be taken during the concurrent reindex of a
single partition (the table itself for a partition table, and the
parent table for a partition index), making the whole operation less
invasive.
--
Michael

#8Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#3)

On Sun, Aug 09, 2020 at 06:44:23PM -0500, Justin Pryzby wrote:

Thanks for looking.

The REINDEX patch is progressing its way, so I have looked a bit at
the part for CIC.

Visibly, the case of multiple partition layers is not handled
correctly. Here is a sequence that gets broken:
CREATE TABLE parent_tab (id int) PARTITION BY RANGE (id);
CREATE TABLE child_0_10 PARTITION OF parent_tab
FOR VALUES FROM (0) TO (10);
CREATE TABLE child_10_20 PARTITION OF parent_tab
FOR VALUES FROM (10) TO (20);
CREATE TABLE child_20_30 PARTITION OF parent_tab
FOR VALUES FROM (20) TO (30);
INSERT INTO parent_tab VALUES (generate_series(0,29));
CREATE TABLE child_30_40 PARTITION OF parent_tab
FOR VALUES FROM (30) TO (40)
PARTITION BY RANGE(id);
CREATE TABLE child_30_35 PARTITION OF child_30_40
FOR VALUES FROM (30) TO (35);
CREATE TABLE child_35_40 PARTITION OF child_30_40
FOR VALUES FROM (35) TO (40);
INSERT INTO parent_tab VALUES (generate_series(30,39));
CREATE INDEX CONCURRENTLY parent_index ON parent_tab (id);

This fails as follows:
ERROR: XX000: unrecognized node type: 2139062143
LOCATION: copyObjectImpl, copyfuncs.c:5718

And the problem visibly comes from some handling with the second level
of partitions, child_30_40 in my example above. Even with that, this
outlines a rather severe problem in the patch: index_set_state_flags()
flips indisvalid on/off using a non-transactional update (see the use
of heap_inplace_update), meaning that if we fail in the middle of the
operation we may finish with a partition index tree where some of the
indexes are valid and some of them are invalid. In order to make this
logic consistent, I am afraid that we will need to do two things:
- Change index_set_state_flags() so as it uses a transactional
update. That's something I would like to change for other reasons,
like making sure that the REPLICA IDENTITY of a parent relation is
correctly reset when dropping a replica index.
- Make all the indexes of the partition tree valid in the *same*
sub-transaction.
You can note that this case is different than a concurrent REINDEX,
because in this case we just do an in-place change between the old and
new index, meaning that even if there is a failure happening while
processing, we may have some invalid indexes, but there are still
valid indexes attached to the partition tree, at any time.

+     MemoryContext   oldcontext = MemoryContextSwitchTo(ind_context);
      PartitionDesc partdesc = RelationGetPartitionDesc(rel);
      int         nparts = partdesc->nparts;
+     char        *relname = pstrdup(RelationGetRelationName(rel));
Er, no.  We should not play with the relation cache calls in a private
memory context.  I think that this needs much more thoughts.
--
Michael
#9Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#7)
1 attachment(s)

On Fri, Aug 14, 2020 at 09:29:45AM +0900, Michael Paquier wrote:

Once this gets done, we should then be able to get rid of the extra
session locking taken when building the list of partitions, limiting
session locks to only be taken during the concurrent reindex of a
single partition (the table itself for a partition table, and the
parent table for a partition index), making the whole operation less
invasive.

The problem with dropped relations in REINDEX has been addressed by
1d65416, so I have gone through this patch again and simplified the
use of session locks, these being taken only when doing a REINDEX
CONCURRENTLY for a given partition. This part is in a rather
committable shape IMO, so I would like to get it done first, before
looking more at the other cases with CIC and CLUSTER. I am still
planning to go through it once again.
--
Michael

Attachments:

reindex-part-v8.patchtext/x-diff; charset=us-ascii
#10Anastasia Lubennikova
Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Michael Paquier (#9)

On 02.09.2020 04:39, Michael Paquier wrote:

The problem with dropped relations in REINDEX has been addressed by
1d65416, so I have gone through this patch again and simplified the
use of session locks, these being taken only when doing a REINDEX
CONCURRENTLY for a given partition. This part is in a rather
committable shape IMO, so I would like to get it done first, before
looking more at the other cases with CIC and CLUSTER. I am still
planning to go through it once again.
--
Michael

Thank you for advancing this work.

I was reviewing the previous version, but the review became outdated
before I sent it. Overall design is fine, but I see a bunch of things
that need to be fixed before commit.

First of all, this patch fails at cfbot:

indexcmds.c:2848:7: error: variable ‘parentoid’ set but not used
[-Werror=unused-but-set-variable]
Oid parentoid;^

It seems to be just a typo. With this minor fix the patch compiles and
passes tests.

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 75008eebde..f5b3c53a83 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2864,7 +2864,7 @@ reindex_partitions(Oid relid, int options, bool concurrent,
                 /* Save partition OID */
                 old_context = MemoryContextSwitchTo(reindex_context);
-               partitions = lappend_oid(partitions, partoid);
+               partitions = lappend_oid(partitions, parentoid);
                 MemoryContextSwitchTo(old_context);
         }

If this guessed fix is correct, I see the problem in the patch logic. In
reindex_partitions() we collect parent relations to pass them to
reindex_multiple_internal(). It implicitly changes the logic from
REINDEX INDEX to REINDEX RELATION, which is not the same, if table has
more than one index. For example, if I add one more index to a
partitioned table from a create_index.sql test:

create index idx on concur_reindex_part_0_2 using btree (c2);

and call

REINDEX INDEX CONCURRENTLY concur_reindex_part_index;

idx will be reindexed as well. I doubt that it is the desired behavior.

A few more random issues, that I noticed at first glance:

1) in create_index.sql

Are these two lines intentional checks that must fail? If so, I propose
to add a comment.

REINDEX TABLE concur_reindex_part_index;
REINDEX TABLE CONCURRENTLY concur_reindex_part_index;

A few lines around also look like they were copy-pasted and need a
second look.

2)  This part of ReindexRelationConcurrently() is misleading.

        case RELKIND_PARTITIONED_TABLE:
        case RELKIND_PARTITIONED_INDEX:
        default:
            /* Return error if type of relation is not supported */
            ereport(ERROR,
                    (errcode(ERRCODE_WRONG_OBJECT_TYPE),
                     errmsg("cannot reindex this type of relation concurrently")));

Maybe assertion is enough. It seems, that we should never reach this
code because both ReindexTable and ReindexIndex handle those relkinds
separately.  Which leads me to the next question.

3) Is there a reason, why reindex_partitions() is not inside
ReindexRelationConcurrently() ? I think that logically it belongs there.

4) I haven't tested multi-level partitions yet. In any case, it would be
good to document this behavior explicitly.

I need a bit more time to review this patch more thoroughly. Please,
wait for it, before committing.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#11Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Anastasia Lubennikova (#10)
1 attachment(s)

On Thu, Sep 03, 2020 at 10:02:58PM +0300, Anastasia Lubennikova wrote:

First of all, this patch fails at cfbot:

indexcmds.c:2848:7: error: variable ‘parentoid’ set but not used
[-Werror=unused-but-set-variable]
Oid parentoid;^

Missed this warning, thanks for pointing it out. This is an incorrect
rebase: this variable was used as a workaround to take a session lock
on the parent table to be reindexed, session lock logic existing to
prevent cache lookup errors when dropping some portions of the tree
concurrently (1d65416 as you already know). But we don't need that
anymore.

If this guessed fix is correct, I see the problem in the patch logic. In
reindex_partitions() we collect parent relations to pass them to
reindex_multiple_internal(). It implicitly changes the logic from REINDEX
INDEX to REINDEX RELATION, which is not the same, if table has more than one
index.

Incorrect guess here. parentoid refers to the parent table of an
index, so by saving its OID in the list of things to reindex, you
would finish by reindexing all the indexes of a partition. We need to
use partoid, as returned by find_all_inheritors() for all the members
of the partition tree (relid can be a partitioned index or partitioned
table in reindex_partitions).

1) in create_index.sql

Are these two lines intentional checks that must fail? If so, I propose to
add a comment.

REINDEX TABLE concur_reindex_part_index;
REINDEX TABLE CONCURRENTLY concur_reindex_part_index;

A few lines around also look like they were copy-pasted and need a second
look.

You can note some slight differences though. These are test cases for
REINDEX INDEX with tables, and REINDEX TABLE with indexes. What you
are quoting here is the part for indexes with REINDEX TABLE. And
there are already comments, see:
"-- REINDEX INDEX fails for partitioned tables"
"-- REINDEX TABLE fails for partitioned indexes"

Perhaps that's not enough, so I have added some more comments to
outline that these commands fail (8 commands in total).

2)  This part of ReindexRelationConcurrently() is misleading.

Maybe assertion is enough. It seems, that we should never reach this code
because both ReindexTable and ReindexIndex handle those relkinds
separately.  Which leads me to the next question.

Yes, we could use an assert, but I did not feel any strong need to
change that either for this patch.

3) Is there a reason, why reindex_partitions() is not inside
ReindexRelationConcurrently() ? I think that logically it belongs there.

Yes, it simplifies the build of the index list indexIds, as there is
no need to loop back into a different routine if working on a table or
a matview.

4) I haven't tested multi-level partitions yet. In any case, it would be
good to document this behavior explicitly.

Not sure what addition we could do here. The patch states that each
partition of the partitioned relation defined gets reindexed, which
implies that this handles multiple layers automatically.

I need a bit more time to review this patch more thoroughly. Please, wait
for it, before committing.

Glad to hear that, please take the time you need.
--
Michael

Attachments:

reindex-part-v9.patchtext/x-diff; charset=us-ascii
#12Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#11)
1 attachment(s)

On Fri, Sep 04, 2020 at 09:51:13AM +0900, Michael Paquier wrote:

Glad to hear that, please take the time you need.

Attached is a rebased version to address the various conflicts after
844c05a.
--
Michael

Attachments:

reindex-part-v10.patchtext/x-diff; charset=us-ascii
#13Anastasia Lubennikova
Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Michael Paquier (#12)

On 07.09.2020 04:58, Michael Paquier wrote:

On Fri, Sep 04, 2020 at 09:51:13AM +0900, Michael Paquier wrote:

Glad to hear that, please take the time you need.

Attached is a rebased version to address the various conflicts after
844c05a.
--
Michael

Thank you.

With the fix for the cycle in reindex_partitions() and new comments
added, I think this patch is ready for commit.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#14Justin Pryzby
Justin Pryzby
pryzby@telsasoft.com
In reply to: Michael Paquier (#8)
2 attachment(s)

Thanks for completing and pushing the REINDEX patch and others.

Here's a rebasified + fixed version of the others.

On Tue, Sep 01, 2020 at 02:51:58PM +0900, Michael Paquier wrote:

The REINDEX patch is progressing its way, so I have looked a bit at
the part for CIC.

Visibly, the case of multiple partition layers is not handled
correctly. Here is a sequence that gets broken:

..

This fails as follows:
ERROR: XX000: unrecognized node type: 2139062143
LOCATION: copyObjectImpl, copyfuncs.c:5718

Because copyObject needed to be called within a longlived context.

Also, my previous revision failed to implement your suggestion to first build
catalog entries with INVALID indexes and to then reindex them. Fixed.

--
Justin

Attachments:

v6-0001-Allow-CREATE-INDEX-CONCURRENTLY-on-partitioned-ta.patchtext/x-diff; charset=us-ascii
v6-0002-Implement-CLUSTER-of-partitioned-table.patchtext/x-diff; charset=us-ascii
#15Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#14)

On Mon, Sep 07, 2020 at 09:39:16PM -0500, Justin Pryzby wrote:

Also, my previous revision failed to implement your suggestion to first build
catalog entries with INVALID indexes and to then reindex them. Fixed.

-   childStmt->oldCreateSubid = InvalidSubTransactionId;
-   childStmt->oldFirstRelfilenodeSubid = childStmt->InvalidSubTransactionId;
+   // childStmt->oldCreateSubid = childStmt->InvalidSubTransactionId;
+   // childStmt->oldFirstRelfilenodeSubid = InvalidSubTransactionId;
In the CIC patch, what is that about?  It is hard to follow what you
are trying to achieve here.
+           index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+           CommandCounterIncrement();
+           index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);
Anyway, this part of the logic is still not good here.  If we fail in
the middle here, we may run into problems with a single partition
index that has only a portion of its flags set.  As this gets called
for each partitioned index, it also means that we could still finish
in a state where we have only a partially-valid partition tree.  For
example, imagine a partition tree with 2 levels (as reported by
pg_partition_tree), then the following happens if an index is created
concurrently from the partitioned table of level 0:
1) indexes are created in level 2 first
2) partitioned table of level 1 is switched to be ready and valid
3) indexes of level 1 are created.
4) partitioned table of level 0 is switched to be ready and valid
If the command has a failure, say between 2 and 3, we would have as
result a command that has partially succeeded in creating a partition
tree, while the user was looking for an index for the full tree.  This
comes back to my previous points, where we should make
index_set_state_flags() first, and I have begun a separate thread
about that:
https://commitfest.postgresql.org/30/2725/

Second, it also means that the patch should really switch all the
indexes to be valid in one single transaction, and that we also need
more careful refactoring of DefineIndex().

I also had a quick look at the patch for CLUSTER, and it does a much
better job, still it has issues.

+   MemoryContext old_context = MemoryContextSwitchTo(cluster_context);
+
+   inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+   foreach(lc, inhoids)
The area where a private memory context is used should be minimized.
In this case, you just need to hold the context while saving the
relation and clustered index information in the list of RelToCluster
items.  As a whole, this case is more simple than CIC, so I'd like to
think that it would be good to work on that as next target.

Coming to my last point.. This thread has dealt since the beginning
with three different problems:
- REINDEX on partitioned relations.
- CLUSTER on partitioned relations.
- CIC on partitioned relations. (Should we also care about DROP INDEX
CONCURRENTLY as well?)

The first problem has been solved, not the two others yet. Do you
think that it could be possible to begin two new separate threads for
the remaining issues, with dedicated CF entries? We could also mark
the existing one as committed, retitled for REINDEX as a matter of
clarity. Also, please note that I am not sure if I will be able to
look more at this thread in this CF.
--
Michael

#16Justin Pryzby
Justin Pryzby
pryzby@telsasoft.com
In reply to: Michael Paquier (#15)

On Tue, Sep 08, 2020 at 01:31:05PM +0900, Michael Paquier wrote:

- CIC on partitioned relations. (Should we also care about DROP INDEX
CONCURRENTLY as well?)

Do you have any idea what you think that should look like for DROP INDEX
CONCURRENTLY ?

--
Justin

#17Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#16)

On Fri, Sep 11, 2020 at 07:13:01PM -0500, Justin Pryzby wrote:

On Tue, Sep 08, 2020 at 01:31:05PM +0900, Michael Paquier wrote:

- CIC on partitioned relations. (Should we also care about DROP INDEX
CONCURRENTLY as well?)

Do you have any idea what you think that should look like for DROP INDEX
CONCURRENTLY ?

Making the maintenance of the partition tree consistent to the user is
the critical part here, so my guess on this matter is:
1) Remove each index from the partition tree and mark the indexes as
invalid in the same transaction. This makes sure that after commit no
indexes would get used for scans, and the partition dependency tree
pis completely removed with the parent table. That's close to what we
do in index_concurrently_swap() except that we just want to remove the
dependencies with the partitions, and not just swap them of course.
2) Switch each index to INDEX_DROP_SET_DEAD, one per transaction
should be fine as that prevents inserts.
3) Finish the index drop.

Step 2) and 3) could be completely done for each index as part of
index_drop(). The tricky part is to integrate 1) cleanly within the
existing dependency machinery while still knowing about the list of
partitions that can be removed. I think that this part is not that
straight-forward, but perhaps we could just make this logic part of
RemoveRelations() when listing all the objects to remove.
--
Michael

#18Justin Pryzby
Justin Pryzby
pryzby@telsasoft.com
In reply to: Michael Paquier (#17)
3 attachment(s)

On Sat, Sep 12, 2020 at 10:35:34AM +0900, Michael Paquier wrote:

On Fri, Sep 11, 2020 at 07:13:01PM -0500, Justin Pryzby wrote:

On Tue, Sep 08, 2020 at 01:31:05PM +0900, Michael Paquier wrote:

- CIC on partitioned relations. (Should we also care about DROP INDEX
CONCURRENTLY as well?)

Do you have any idea what you think that should look like for DROP INDEX
CONCURRENTLY ?

Making the maintenance of the partition tree consistent to the user is
the critical part here, so my guess on this matter is:
1) Remove each index from the partition tree and mark the indexes as
invalid in the same transaction. This makes sure that after commit no
indexes would get used for scans, and the partition dependency tree
pis completely removed with the parent table. That's close to what we
do in index_concurrently_swap() except that we just want to remove the
dependencies with the partitions, and not just swap them of course.
2) Switch each index to INDEX_DROP_SET_DEAD, one per transaction
should be fine as that prevents inserts.
3) Finish the index drop.

Step 2) and 3) could be completely done for each index as part of
index_drop(). The tricky part is to integrate 1) cleanly within the
existing dependency machinery while still knowing about the list of
partitions that can be removed. I think that this part is not that
straight-forward, but perhaps we could just make this logic part of
RemoveRelations() when listing all the objects to remove.

Thanks.

I see three implementation ideas..

1. I think your way has an issue that the dependencies are lost. If there's an
interruption, the user is maybe left with hundreds or thousands of detached
indexes to clean up. This is strange since there's actually no detach command
for indexes (but they're implicitly "attached" when a matching parent index is
created). A 2nd issue is that DETACH currently requires an exclusive lock (but
see Alvaro's WIP patch).

2. Maybe the easiest way is to mark all indexes invalid and then drop all
partitions (concurrently) and then the partitioned parent. If interrupted,
this would leave a parent index marked "invalid", and some child tables with no
indexes. I think this may be "ok". The same thing is possible if a concurrent
build is interrupted, right ?

3. I have a patch which changes index_drop() to "expand" a partitioned index into
its list of children. Each of these becomes a List:
| indexId, heapId, userIndexRelation, userHeapRelation, heaplocktag, heaprelid, indexrelid
The same process is followed as for a single index, but handling all partitions
at once in two transactions total. Arguably, this is bad since that function
currently takes a single Oid but would now ends up operating on a list of indexes.

Anyway, for now this is rebased on 83158f74d.

--
Justin

Attachments:

v7-0001-Allow-CREATE-INDEX-CONCURRENTLY-on-partitioned-ta.patchtext/x-diff; charset=us-ascii
v7-0002-Implement-CLUSTER-of-partitioned-table.patchtext/x-diff; charset=us-ascii
v7-0003-WIP-implement-DROP-INDEX-CONCURRENTLY-on-partitio.patchtext/x-diff; charset=us-ascii
#19Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#18)

On Mon, Sep 14, 2020 at 09:31:03AM -0500, Justin Pryzby wrote:

Anyway, for now this is rebased on 83158f74d.

I have not thought yet about all the details of CIC and DIC and what
you said upthread, but I have gone through CLUSTER for now, as a
start. So, the goal of the patch, as I would define it, is to give a
way to CLUSTER to work on a partitioned table using a given
partitioned index. Hence, we would perform CLUSTER automatically from
the top-most parent for any partitions that have an index on the same
partition tree as the partitioned index defined in USING, switching
indisclustered automatically depending on the index used.

+CREATE TABLE clstrpart3 PARTITION OF clstrpart DEFAULT PARTITION BY RANGE(a);
+CREATE TABLE clstrpart33 PARTITION OF clstrpart3 DEFAULT;
 CREATE INDEX clstrpart_idx ON clstrpart (a);
-ALTER TABLE clstrpart CLUSTER ON clstrpart_idx
So..  For any testing of partitioned trees, we should be careful to
check if relfilenodes have been changed or not as part of an
operation, to see if the operation has actually done something.

From what I can see, attempting to use a CLUSTER on a top-most
partitioned table fails to work on child tables, but isn't the goal of
the patch to make sure that if we attempt to do the operation on a
partitioned table using a partitioned index, then the operation should
be done as well on the partitions with the partition index part of the
same partition tree as the parent partitioned index? If using CLUSTER
on a new partitioned index with USING, it seems to me that we may want
to check that indisclustered is correctly set for all the partitions
concerned in the regression tests. It would be good also to check if
we have a partition index tree that maps partially with a partition
table tree (aka no all table partitions have a partition index), where
these don't get clustered because there is no index to work on.

+   MemoryContext old_context = MemoryContextSwitchTo(cluster_context);
+   inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+   MemoryContextSwitchTo(old_context);
Er, isn't that incorrect?  I would have assumed that what should be
saved in the context of cluster is the list of RelToCluster items.
And we should not do any lookup of the partitions in a different
context, because this may do allocations of things we don't really
need to keep around for the context dedicated to CLUSTER.  Isn't
NoLock unsafe here, even knowing that an exclusive lock is taken on
the parent?  It seems to me that at least schema changes should be
prevented with a ShareLock in the first transaction building the list
of elements to cluster.
+       /*
+        * We have a full list of direct and indirect children, so skip
+        * partitioned tables and just handle their children.
+        */
+       if (get_rel_relkind(relid) == RELKIND_PARTITIONED_TABLE)
+           continue;
It would be better to use RELKIND_HAS_STORAGE here.

All this stuff needs to be documented clearly.
--
Michael

#20Justin Pryzby
Justin Pryzby
pryzby@telsasoft.com
In reply to: Michael Paquier (#19)
3 attachment(s)

On Thu, Sep 24, 2020 at 05:11:03PM +0900, Michael Paquier wrote:

start. So, the goal of the patch, as I would define it, is to give a
way to CLUSTER to work on a partitioned table using a given
partitioned index. Hence, we would perform CLUSTER automatically from
the top-most parent for any partitions that have an index on the same
partition tree as the partitioned index defined in USING, switching
indisclustered automatically depending on the index used.

I think that's right, except there's no need to look for a compatible
partitioned index: we just use the child index.

Also, if a partitioned index is clustered, when we clear indisclustered for
other indexes, should we also propogate that to their parent indexes, if any ?

From what I can see, attempting to use a CLUSTER on a top-most
partitioned table fails to work on child tables,

Oops - it looks like this patch never worked right, due to the RECHECK on
indisclustered. I think maybe I returned to the CIC patch and never finishing
with cluster.

It would be good also to check if
we have a partition index tree that maps partially with a partition
table tree (aka no all table partitions have a partition index), where
these don't get clustered because there is no index to work on.

This should not happen, since a incomplete partitioned index is "invalid".

Isn't NoLock unsafe here, even knowing that an exclusive lock is taken on
the parent? It seems to me that at least schema changes should be
prevented with a ShareLock in the first transaction building the list
of elements to cluster.

Thanks for noticing. I chose ShareUpdateExclusiveLock since it's
set-exclusive.

--
Justin

Attachments:

v8-0001-Allow-CREATE-INDEX-CONCURRENTLY-on-partitioned-ta.patchtext/x-diff; charset=us-ascii
v8-0002-Implement-CLUSTER-of-partitioned-table.patchtext/x-diff; charset=us-ascii
v8-0003-WIP-implement-DROP-INDEX-CONCURRENTLY-on-partitio.patchtext/x-diff; charset=us-ascii
#21Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#20)

On Sat, Sep 26, 2020 at 02:56:55PM -0500, Justin Pryzby wrote:

Also, if a partitioned index is clustered, when we clear indisclustered for
other indexes, should we also propogate that to their parent indexes, if any ?

I am not sure what you mean here. Each partition's cluster runs in
its own individual transaction based on the patch you sent. Are you
suggesting to update indisclustered for the partitioned index of a
partitioned table and all its parent partitioned in the same
transaction, aka a transaction working on the partitioned table?
Doesn't that mean that if we have a partition tree with multiple
layers then we finish by doing multiple time the same operation for
the parents?

It would be good also to check if
we have a partition index tree that maps partially with a partition
table tree (aka no all table partitions have a partition index), where
these don't get clustered because there is no index to work on.

This should not happen, since a incomplete partitioned index is "invalid".

Indeed, I did not know this property. I can see also that you have
added a test for this case, so that's good if we can rely on that. I
am still in the process of reviewing this patch, all this handling
around indisclustered makes it rather complex.
--
Michael

#22Justin Pryzby
Justin Pryzby
pryzby@telsasoft.com
In reply to: Michael Paquier (#21)

On Mon, Oct 05, 2020 at 05:46:27PM +0900, Michael Paquier wrote:

On Sat, Sep 26, 2020 at 02:56:55PM -0500, Justin Pryzby wrote:

Also, if a partitioned index is clustered, when we clear indisclustered for
other indexes, should we also propogate that to their parent indexes, if any ?

I am not sure what you mean here. Each partition's cluster runs in
its own individual transaction based on the patch you sent. Are you
suggesting to update indisclustered for the partitioned index of a
partitioned table and all its parent partitioned in the same
transaction, aka a transaction working on the partitioned table?

No, I mean that if a partition is no longer clustered on some index, then its
parent isn't clustered on that indexes parent, either.

It means that we might do N catalog updates for a partition heirarchy that's N
levels deep. Normally, N=2, and we'd clear indisclustered for the index as
well as its parent. This is not essential, though.

It would be good also to check if
we have a partition index tree that maps partially with a partition
table tree (aka no all table partitions have a partition index), where
these don't get clustered because there is no index to work on.

This should not happen, since a incomplete partitioned index is "invalid".

Indeed, I did not know this property.

I think that's something we can apply for CIC/DIC, too.
It's not essential to avoid leaving an "invalid" or partial index if
interrupted. It's only essential that a partial, partitioned index is not
"valid".

For DROP IND CONCURRENTLY, I wrote:

On Mon, Sep 14, 2020 at 09:31:03AM -0500, Justin Pryzby wrote:

2. Maybe the easiest way is to mark all indexes invalid and then drop all
partitions (concurrently) and then the partitioned parent. If interrupted,
this would leave a parent index marked "invalid", and some child tables with no
indexes. I think this may be "ok". The same thing is possible if a concurrent
build is interrupted, right ?

--
Justin

#23Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#22)

On Mon, Oct 05, 2020 at 03:08:32PM -0500, Justin Pryzby wrote:

It means that we might do N catalog updates for a partition heirarchy that's N
levels deep. Normally, N=2, and we'd clear indisclustered for the index as
well as its parent. This is not essential, though.

Hmm. I got to think more about this one, and being able to ensure a
consistent state of indisclustered for partitioned tables and all
their partitions across multiple transactions at all times would be a
nice property, as we could easily track down if an operation has
failed in-flight. The problem here is that we are limited by
indisclustered being a boolean, so we may set indisclustered one way
or another on some partitions, or on some parent partitioned tables,
but we are not sure if what's down is actually clustered for the
leaves of a partitioned index, or not. Or maybe we even have an
inconsistent state, so this does not provide a good user experience.
The consistency of a partition tree is a key thing, and we have such
guarantees with REINDEX (with or without concurrently), so what I'd
like to think that what makes sense for indisclustered on a
partitioned index is to have the following set of guarantees, and
relying on indisvalid as being true iff an index partition tree is
complete on a given table partition tree is really important:
- If indisclustered is set to true for a partitioned index, then we
have the guarantee that all its partition and partitioned indexes have
indisclustered set to true, across all the layers down to the leaves.
- If indisclustered is set to false for a partitioned index, then we
have the guarantee that all its partition and partitioned indexes have
indisclustered set to false.

If we happen to attach a new partition to a partitioned table of such
a tree, I guess that it is then logic to have indisclustered set to
the same value as the partitioned index when the partition inherits an
index. So, it seems to me that with the current catalogs we are
limited by indisclustered as being a boolean to maintain some kind of
consistency across transactions of CLUSTER, as we would use one
transaction per leaf for the work. In order to make things better
here, we could switch indisclustered to a char, able to use three
states:
- 'e' or enabled, equivalent to indisclustered == true. There should
be only one index on a table with 'e' set at a given time.
- 'd' or disabled, equivalent to indisclustered == false.
- a new third state, for an equivalent of work-in-progress, let's call
it 'w', or whatever.

Then, when we begin a CLUSTER on a partitioned table, I would imagine
the following:
- Switch all the indexes selected to 'w' in a first transaction, and
do not reset the state of other indexes if there is one 'e'. For
CLUSTER without USING, we switch the existing 'e' to 'w' if there is
such an index. If there are no indexes select-able, issue an error.
If we find an index with 'w', we have a candidate from a previous
failed command, so this gets used. I don't think that this design
requires a new DDL either as we could reset all 'w' and 'e' to 'd' if
using ALTER TABLE SET WITHOUT CLUSTER on a partitioned table. For
CLUSTER with USING, the partitioned index selected and its leaves are
switched to 'w', similarly.
- Then do the work for all the partitions, one partition per
transaction, keeping 'w'.
- In a last transaction, switch all the partitions from 'w' to 'e',
resetting any existing 'e'.

ALTER TABLE CLUSTER ON should also be a supported operation, where 'e'
gets switched for all the partitions in a single transaction. Of
course, this should not work for an invalid index. Even with such a
design, I got to wonder if there could be cases where a user does
*not* want to cluster the leaves of a partition tree with the same
index. If that were to happen, the partition tree may need a redesign
so making things work so as we force partitions to inherit what's
wanted for the partitioned table makes the most sense to me.

By the way, I mentioned that previously, but this thread got used for
REINDEX that is finished, and now we have a discussion going on with
CLUSTER. There is also stuff related to CIC and DIC. There is also
only one CF entry for all that. I really think that this work had
better be split into separate threads, with separate CF entries. Do
you mind if I change the contents of the CF app so as the existing
item is renamed for REINDEX, that got committed? Then we could create
a new entry for CIC/DIC (it may make sense to split these two as
well, but I am not if there are any overlaps between the APIs of the
two), and a new entry for CLUSTER, depending on the state of the work.
The original subject of the thread is also unrelated, this makes the
review process unnecessarily complicated, and that's really
confusing. Each discussion should go into its own, independent,
thread.
--
Michael

#24Justin Pryzby
Justin Pryzby
pryzby@telsasoft.com
In reply to: Michael Paquier (#23)

On Tue, Oct 06, 2020 at 11:42:55AM +0900, Michael Paquier wrote:

On Mon, Oct 05, 2020 at 03:08:32PM -0500, Justin Pryzby wrote:

It means that we might do N catalog updates for a partition heirarchy that's N
levels deep. Normally, N=2, and we'd clear indisclustered for the index as
well as its parent. This is not essential, though.

Hmm. I got to think more about this one, and being able to ensure a
consistent state of indisclustered for partitioned tables and all
their partitions across multiple transactions at all times would be a
nice property, as we could easily track down if an operation has
failed in-flight. The problem here is that we are limited by
indisclustered being a boolean, so we may set indisclustered one way
or another on some partitions, or on some parent partitioned tables,
but we are not sure if what's down is actually clustered for the
leaves of a partitioned index, or not. Or maybe we even have an
inconsistent state, so this does not provide a good user experience.
The consistency of a partition tree is a key thing, and we have such
guarantees with REINDEX (with or without concurrently), so what I'd
like to think that what makes sense for indisclustered on a
partitioned index is to have the following set of guarantees, and
relying on indisvalid as being true iff an index partition tree is
complete on a given table partition tree is really important:
- If indisclustered is set to true for a partitioned index, then we
have the guarantee that all its partition and partitioned indexes have
indisclustered set to true, across all the layers down to the leaves.
- If indisclustered is set to false for a partitioned index, then we
have the guarantee that all its partition and partitioned indexes have
indisclustered set to false.

If we happen to attach a new partition to a partitioned table of such
a tree, I guess that it is then logic to have indisclustered set to
the same value as the partitioned index when the partition inherits an
index. So, it seems to me that with the current catalogs we are
limited by indisclustered as being a boolean to maintain some kind of
consistency across transactions of CLUSTER, as we would use one
transaction per leaf for the work. In order to make things better
here, we could switch indisclustered to a char, able to use three
states:
- 'e' or enabled, equivalent to indisclustered == true. There should
be only one index on a table with 'e' set at a given time.
- 'd' or disabled, equivalent to indisclustered == false.
- a new third state, for an equivalent of work-in-progress, let's call
it 'w', or whatever.

Then, when we begin a CLUSTER on a partitioned table, I would imagine
the following:
- Switch all the indexes selected to 'w' in a first transaction, and
do not reset the state of other indexes if there is one 'e'. For
CLUSTER without USING, we switch the existing 'e' to 'w' if there is
such an index. If there are no indexes select-able, issue an error.
If we find an index with 'w', we have a candidate from a previous
failed command, so this gets used. I don't think that this design
requires a new DDL either as we could reset all 'w' and 'e' to 'd' if
using ALTER TABLE SET WITHOUT CLUSTER on a partitioned table. For
CLUSTER with USING, the partitioned index selected and its leaves are
switched to 'w', similarly.
- Then do the work for all the partitions, one partition per
transaction, keeping 'w'.
- In a last transaction, switch all the partitions from 'w' to 'e',
resetting any existing 'e'.

Honestly, I think you're over-thinking and over-engineering indisclustered.

If "clusteredness" was something we offered to maintain across DML, I think
that might be important to provide stronger guarantees. As it is now, I don't
think this patch is worth changing the catalog definition.

ALTER TABLE CLUSTER ON should also be a supported operation, where 'e'
gets switched for all the partitions in a single transaction. Of
course, this should not work for an invalid index. Even with such a
design, I got to wonder if there could be cases where a user does
*not* want to cluster the leaves of a partition tree with the same
index. If that were to happen, the partition tree may need a redesign
so making things work so as we force partitions to inherit what's
wanted for the partitioned table makes the most sense to me.

I wondered the same thing: should clustering a parent index cause the the child
indexes to be marked as clustered ? Or should it be possible for an
intermediate child to say (marked as) clustered on some other index. I don't
have strong feelings, but the patch currently sets indisclustered as a natural
consequence of the implementation, so I tentatively think that's what's right.
After all, CLUSTER sets indisclustered without having to also say
"ALTER..CLUSTER ON".

This is relevant to the question I raised about unsetting indisclustered for
each indexes *parent* when clustering on a different index.

I think it would be strange if we refused "ALTER..CLUSTER ON" for a partition
just because a different partitioned index was set clustered. We'd clear that,
like always, and then (in my proposal) also clear its parents "indisclustered".
I still don't think that's essential, though.

By the way, I mentioned that previously, but this thread got used for
REINDEX that is finished, and now we have a discussion going on with
CLUSTER. There is also stuff related to CIC and DIC. There is also
only one CF entry for all that. I really think that this work had
better be split into separate threads, with separate CF entries. Do
you mind if I change the contents of the CF app so as the existing
item is renamed for REINDEX, that got committed? Then we could create
a new entry for CIC/DIC (it may make sense to split these two as
well, but I am not if there are any overlaps between the APIs of the
two), and a new entry for CLUSTER, depending on the state of the work.
The original subject of the thread is also unrelated, this makes the
review process unnecessarily complicated, and that's really
confusing. Each discussion should go into its own, independent,
thread.

I didn't think it's worth the overhead of closing and opening more CFs.
But I don't mind.

--
Justin

#25Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Justin Pryzby (#24)

On Mon, Oct 05, 2020 at 10:07:33PM -0500, Justin Pryzby wrote:

Honestly, I think you're over-thinking and over-engineering indisclustered.

If "clusteredness" was something we offered to maintain across DML, I think
that might be important to provide stronger guarantees. As it is now, I don't
think this patch is worth changing the catalog definition.

Well, this use case is new because we are discussing the relationship
of indisclustered across multiple transactions for multiple indexes,
so I'd rather have this discussion than not, and I have learnt
the hard way with REINDEX that we should care a lot about the
consistency of partition trees at any step of the operation. Let's
imagine a simple example here, take this partition tree: p (parent),
and two partitions p1 and p2. p has two partitioned indexes i and j,
indexes also present in p1 and p2 as i1, i2, j1 and j2. Let's assume
that the user has done a CLUSTER on p USING i that completes, meaning
that i, i1 and i2 have indisclustered set. Now let's assume that the
user does a CLUSTER on p USING j this time, and that this command
fails while processing p2, meaning that indisclustered is set for j1,
i2, and perhaps i or j depending on what the patch does. Per the
latest arguments, j would be the one set to indisclustered. From this
inconsistent state comes a couple of interesting things:
- A database-wide CLUSTER would finish by using j1 and i2 for the
operation on the partitions, while the intention was to use j2 for the
second partition as the previous command failed.
- With CLUSTER p, without USING. Logically, I would assume that we
would rely on the value of indisclustered as of j, meaning that we
would *enforce* p2 to use j2. But it could also be seen as incorrect
by the user because we would not use the index originally marked as
such.

So keeping this consistent has the advantage to have clear rules
here.

I think it would be strange if we refused "ALTER..CLUSTER ON" for a partition
just because a different partitioned index was set clustered. We'd clear that,
like always, and then (in my proposal) also clear its parents "indisclustered".
I still don't think that's essential, though.

Why? Blocking a partition, which may be itself partitioned, to switch
to a different index if its partitioned parent uses something else
sounds kind of logic to me, at the end, because the user originally
intended to use CLUSTER with a specific index on this tree. So I
would say that the partitioned table takes priority, and this should
be released with a WITHOUT CLUSTER from the partitioned table.

I didn't think it's worth the overhead of closing and opening more CFs.
But I don't mind.

Thanks, I'll do some cleanup.
--
Michael