how to create index concurrently on paritioned table
Hi hackers,
Partitioning is necessary for very large tables.
However, I found that postgresql does not support create index concurrently on partitioned tables.
The document show that we need to create an index on each partition individually and then finally create the partitioned index non-concurrently.
This is undoubtedly a complex operation for DBA, especially when there are many partitions.
Therefore, I wonder why pg does not support concurrent index creation on partitioned tables?
What are the difficulties of this function?
If I want to implement it, what should I pay attention?
Sincerely look forward to your reply.
Regards & Thanks Adger
On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote:
Partitioning is necessary for very large tables.
However, I found that postgresql does not support create index concurrently on partitioned tables.
The document show that we need to create an index on each partition individually and then finally create the partitioned index non-concurrently.
This is undoubtedly a complex operation for DBA, especially when there are many partitions.
Therefore, I wonder why pg does not support concurrent index creation on partitioned tables?
What are the difficulties of this function?
If I want to implement it, what should I pay attention?
Maybe I'm wrong, but I don't think there's any known difficulty - just that
nobody did it yet. You should pay attention to what happens on error, but
hopefully you wouldn't need to add much code and can rely on existing code to
paths to handle that right.
I think you'd look at the commits and code implementing indexes on partitioned
tables and CREATE INDEX CONCURRENTLY. And maybe any following commits with
fixes.
You'd first loop around all children (recursively if there are partitions which
are themselves partitioned) and create indexes concurrently.
--
Justin
On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote:
Partitioning is necessary for very large tables.
However, I found that postgresql does not support create index concurrently on partitioned tables.
The document show that we need to create an index on each partition individually and then finally create the partitioned index non-concurrently.
This is undoubtedly a complex operation for DBA, especially when there are many partitions.Therefore, I wonder why pg does not support concurrent index creation on partitioned tables?
What are the difficulties of this function?
If I want to implement it, what should I pay attention?Maybe I'm wrong, but I don't think there's any known difficulty - just that
nobody did it yet.
I said that but I was actually thinking about the code for "REINDEX
CONCURRENTLY" (which should also handle partitioned tables).
I looked at CIC now and came up with the attached. All that's needed to allow
this case is to close the relation before recursing to partitions - it needs to
be closed before calling CommitTransactionCommand(). There's probably a better
way to write this, but I can't see that there's anything complicated about
handling partitioned tables.
--
Justin
Attachments:
v1-0001-Allow-CREATE-INDEX-CONCURRENTLY-on-partitioned-ta.patchtext/x-diff; charset=us-asciiDownload+31-33
On Sun, Jun 07, 2020 at 01:04:48PM -0500, Justin Pryzby wrote:
On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote:
Partitioning is necessary for very large tables. However, I found that
postgresql does not support create index concurrently on partitioned
tables. The document show that we need to create an index on each
partition individually and then finally create the partitioned index
non-concurrently. This is undoubtedly a complex operation for DBA,
especially when there are many partitions.
I added functionality for C-I-C, REINDEX-CONCURRENTLY, and CLUSTER of
partitioned tables. We already recursively handle VACUUM and ANALYZE since
v10.
And added here:
https://commitfest.postgresql.org/28/2584/
Adger, if you're familiar with compilation and patching, do you want to try the
patch ?
Note, you could do this now using psql like:
SELECT format('CREATE INDEX CONCURRENTLY ... ON %s(col)', a::regclass) FROM pg_partition_ancestors() AS a;
\gexec
--
Justin
Attachments:
v2-0001-Allow-CREATE-INDEX-CONCURRENTLY-on-partitioned-ta.patchtext/x-diff; charset=us-asciiDownload+52-41
v2-0002-Implement-CLUSTER-of-partitioned-table.patchtext/x-diff; charset=us-asciiDownload+119-41
v2-0003-Implement-REINDEX-of-partitioned-tables-indexes.patchtext/x-diff; charset=us-asciiDownload+89-65
On Thu, Jun 11, 2020 at 10:35:02AM -0500, Justin Pryzby wrote:
Note, you could do this now using psql like:
SELECT format('CREATE INDEX CONCURRENTLY ... ON %s(col)', a::regclass) FROM pg_partition_ancestors() AS a;
\gexec
I have skimmed quickly through the patch set, and something has caught
my attention.
drop table idxpart; --- Some unsupported features +-- CIC on partitioned table create table idxpart (a int, b int, c text) partition by range (a); create table idxpart1 partition of idxpart for values from (0) to (10); create index concurrently on idxpart (a); -ERROR: cannot create index on partitioned table "idxpart" concurrently +\d idxpart1
When it comes to test behaviors specific to partitioning, there are in
my experience three things to be careful about and stress in the tests:
- Use at least two layers of partitioning.
- Include into the partition tree a partition that has no leaf
partitions.
- Test the commands on the top-most parent, a member in the middle of
the partition tree, the partition with no leaves, and one leaf, making
sure that relfilenode changes where it should and that partition trees
remain intact (you can use pg_partition_tree() for that.)
That's to say that the amount of regression tests added here is not
sufficient in my opinion.
--
Michael
On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote:
Partitioning is necessary for very large tables.
However, I found that postgresql does not support create index concurrently on partitioned tables.
The document show that we need to create an index on each partition individually and then finally create the partitioned index non-concurrently.
This is undoubtedly a complex operation for DBA, especially when there are many partitions.Therefore, I wonder why pg does not support concurrent index creation on partitioned tables?
What are the difficulties of this function?
If I want to implement it, what should I pay attention?Maybe I'm wrong, but I don't think there's any known difficulty - just that
nobody did it yet.
I said that but I was actually thinking about the code for "REINDEX
CONCURRENTLY" (which should also handle partitioned tables).
I looked at CIC now and came up with the attached. All that's needed to allow
this case is to close the relation before recursing to partitions - it needs to
be closed before calling CommitTransactionCommand(). There's probably a better
way to write this, but I can't see that there's anything complicated about
handling partitioned tables.
Hi, Justin Pryzby
I'm so sorry about getting back late.
Thank you very much for helping me consider this issue.
I compiled the patch v1 you provided. And I patch v2-001 again to enter postgresql.
I got a coredump that was easy to reproduce. As follows:
#0 PopActiveSnapshot () at snapmgr.c:822
#1 0x00000000005ca687 in DefineIndex (relationId=relationId@entry=16400,
stmt=stmt@entry=0x1aa5e28, indexRelationId=16408, indexRelationId@entry=0,
parentIndexId=parentIndexId@entry=16406,
parentConstraintId=0, is_alter_table=is_alter_table@entry=false,
check_rights=true, check_not_in_use=true, skip_build=false, quiet=false)
at indexcmds.c:1426
#2 0x00000000005ca5ab in DefineIndex (relationId=relationId@entry=16384,
stmt=stmt@entry=0x1b35278, indexRelationId=16406, indexRelationId@entry=0,
parentIndexId=parentIndexId@entry=0,
parentConstraintId=parentConstraintId@entry=0, is_alter_table=
is_alter_table@entry=false, check_rights=true, check_not_in_use=true,
skip_build=false, quiet=false) at indexcmds.c:1329
#3 0x000000000076bf80 in ProcessUtilitySlow (pstate=pstate@entry=0x1b350c8,
pstmt=pstmt@entry=0x1a2bf40,
queryString=queryString@entry=0x1a2b2c8 "create index CONCURRENTLY
idxpart_a_idx on idxpart (a);", context=context@entry=PROCESS_UTILITY_TOPLEVEL,
params=params@entry=0x0,
queryEnv=queryEnv@entry=0x0, qc=0x7ffc86cc7630, dest=0x1a2c200) at
utility.c:1474
#4 0x000000000076afeb in standard_ProcessUtility (pstmt=0x1a2bf40,
queryString=0x1a2b2c8 "create index CONCURRENTLY idxpart_a_idx on idxpart (a);",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x1a2c200, qc=0x7ffc86cc7630) at utility.c:1069
#5 0x0000000000768992 in PortalRunUtility (portal=0x1a8d1f8, pstmt=0x1a2bf40,
isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=<optimized out>,
qc=0x7ffc86cc7630) at pquery.c:1157
#6 0x00000000007693f3 in PortalRunMulti (portal=portal@entry=0x1a8d1f8,
isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x1a2c200,
altdest=altdest@entry=0x1a2c200, qc=qc@entry=0x7ffc86cc7630) at pquery.c:1310
#7 0x0000000000769ed3 in PortalRun (portal=portal@entry=0x1a8d1f8, count=count
@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once
@entry=true, dest=dest@entry=0x1a2c200,
altdest=altdest@entry=0x1a2c200, qc=0x7ffc86cc7630) at pquery.c:779
#8 0x0000000000765b06 in exec_simple_query (query_string=0x1a2b2c8 "create
index CONCURRENTLY idxpart_a_idx on idxpart (a);") at postgres.c:1239
#9 0x0000000000767de5 in PostgresMain (argc=<optimized out>, argv=argv@entry
=0x1a552c8, dbname=<optimized out>, username=<optimized out>) at postgres.c:4315
#10 0x00000000006f2b23 in BackendRun (port=0x1a4d1e0, port=0x1a4d1e0) at postmaster.c:4523
#11 BackendStartup (port=0x1a4d1e0) at postmaster.c:4215
#12 ServerLoop () at postmaster.c:1727
#13 0x00000000006f3a1f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1a25ea0)
at postmaster.c:1400
#14 0x00000000004857f9 in main (argc=3, argv=0x1a25ea0) at main.c:210
You can re-produce it like this:
```
create table idxpart (a int, b int, c text) partition by range (a);
create table idxpart1 partition of idxpart for values from (0) to (10);
create table idxpart2 partition of idxpart for values from (10) to (20);
create index CONCURRENTLY idxpart_a_idx on idxpart (a);
````
I have been trying to get familiar with the source code of create index.
Can you solve this bug first? I will try my best to implement CIC with you.
Next, I will read your patchs v2-002 and v2-003.
Thank you very much,
Regards, Adger
Hi Justin,
Maybe I'm wrong, but I don't think there's any known difficulty - just that> nobody did it yet. You should pay attention to what happens on error, but
hopefully you wouldn't need to add much code and can rely on existing code to
paths to handle that right.
yes, I am trying to learn the code of index definition.
I think you'd look at the commits and code implementing indexes on partitioned
tables and CREATE INDEX CONCURRENTLY. And maybe any following commits with
fixes.
You'd first loop around all children (recursively if there are partitions which
are themselves partitioned) and create indexes concurrently.
As we all know, CIC has three transactions. If we recursively in n partitioned tables,
it will become 3N transactions. If an error occurs in these transactions, we have too many things to deal...
If an error occurs when an index is created in one of the partitions,
what should we do with our new index?
Thank you very much,
Regards, Adger
On Fri, Jun 12, 2020 at 04:17:34PM +0800, 李杰(慎追) wrote:
As we all know, CIC has three transactions. If we recursively in n partitioned tables,
it will become 3N transactions. If an error occurs in these transactions, we have too many things to deal...If an error occurs when an index is created in one of the partitions,
what should we do with our new index?
My (tentative) understanding is that these types of things should use a
"subtransaction" internally.. So if the toplevel transaction rolls back, its
changes are lost. In some cases, it might be desirable to not roll back, in
which case the user(client) should first create indexes (concurrently if
needed) on every child, and then later create index on parent (that has the
advtantage of working on older servers, too).
postgres=# SET client_min_messages=debug;
postgres=# CREATE INDEX ON t(i);
DEBUG: building index "t1_i_idx" on table "t1" with request for 1 parallel worker
DEBUG: index "t1_i_idx" can safely use deduplication
DEBUG: creating and filling new WAL file
DEBUG: done creating and filling new WAL file
DEBUG: creating and filling new WAL file
DEBUG: done creating and filling new WAL file
DEBUG: building index "t2_i_idx" on table "t2" with request for 1 parallel worker
^C2020-06-12 13:08:17.001 CDT [19291] ERROR: canceling statement due to user request
2020-06-12 13:08:17.001 CDT [19291] STATEMENT: CREATE INDEX ON t(i);
2020-06-12 13:08:17.001 CDT [27410] FATAL: terminating connection due to administrator command
2020-06-12 13:08:17.001 CDT [27410] STATEMENT: CREATE INDEX ON t(i);
Cancel request sent
If the index creation is interrupted at this point, no indexes will exist.
On Fri, Jun 12, 2020 at 04:06:28PM +0800, 李杰(慎追) wrote:
On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
I looked at CIC now and came up with the attached. All that's needed to allow
this case is to close the relation before recursing to partitions - it needs to
be closed before calling CommitTransactionCommand(). There's probably a better
way to write this, but I can't see that there's anything complicated about
handling partitioned tables.I'm so sorry about getting back late.
Thank you very much for helping me consider this issue.
I compiled the patch v1 you provided. And I patch v2-001 again to enter postgresql.
I got a coredump that was easy to reproduce. As follows:
I have been trying to get familiar with the source code of create index.
Can you solve this bug first? I will try my best to implement CIC with you.
Next, I will read your patchs v2-002 and v2-003.
Thanks, fixed.
On Fri, Jun 12, 2020 at 04:20:17PM +0900, Michael Paquier wrote:
When it comes to test behaviors specific to partitioning, there are in
my experience three things to be careful about and stress in the tests:
- Use at least two layers of partitioning.
- Include into the partition tree a partition that has no leaf
partitions.
- Test the commands on the top-most parent, a member in the middle of
the partition tree, the partition with no leaves, and one leaf, making
sure that relfilenode changes where it should and that partition trees
remain intact (you can use pg_partition_tree() for that.)
Added, thanks for looking.
--
Justin
Attachments:
v3-0001-Allow-CREATE-INDEX-CONCURRENTLY-on-partitioned-ta.patchtext/x-diff; charset=us-asciiDownload+71-38
v3-0002-Implement-REINDEX-of-partitioned-tables-indexes.patchtext/x-diff; charset=us-asciiDownload+106-65
v3-0003-Implement-CLUSTER-of-partitioned-table.patchtext/x-diff; charset=us-asciiDownload+133-41
Import Notes
Reply to msg id not found: 3c2edeb5-e216-4411-9a74-f27fcc0f6793.adger.lj@alibaba-inc.com20200612072017.GE3362@paquier.xyz | Resolved by subject fallback
My (tentative) understanding is that these types of things should use a
"subtransaction" internally.. So if the toplevel transaction rolls back, its
changes are lost. In some cases, it might be desirable to not roll back, in
which case the user(client) should first create indexes (concurrently if
needed) on every child, and then later create index on parent (that has the
advtantage of working on older servers, too).
Hi Justin,
I have a case here, you see if it meets your expectations.
`````
postgres=# CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE
postgres=# CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (25);
CREATE TABLE
postgres=# CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (25) TO (50);
CREATE TABLE
postgres=# CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (50) TO (75);
CREATE TABLE
postgres=# INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 74) i;
INSERT 0 75
postgres=# insert into prt1 values (26,1,'FM0026');
INSERT 0 1
postgres=# create unique index CONCURRENTLY idexpart_cic on prt1 (a);
ERROR: could not create unique index "prt1_p2_a_idx"
DETAIL: Key (a)=(26) is duplicated.
postgres=# \d+ prt1
Partitioned table "public.prt1"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition key: RANGE (a)
Indexes:
"idexpart_cic" UNIQUE, btree (a) INVALID
Partitions: prt1_p1 FOR VALUES FROM (0) TO (25),
prt1_p2 FOR VALUES FROM (25) TO (50),
prt1_p3 FOR VALUES FROM (50) TO (75)
postgres=# \d+ prt1_p1
Table "public.prt1_p1"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition of: prt1 FOR VALUES FROM (0) TO (25)
Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 25))
Indexes:
"prt1_p1_a_idx" UNIQUE, btree (a)
Access method: heap
postgres=# \d+ prt1_p2
Table "public.prt1_p2"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition of: prt1 FOR VALUES FROM (25) TO (50)
Partition constraint: ((a IS NOT NULL) AND (a >= 25) AND (a < 50))
Indexes:
"prt1_p2_a_idx" UNIQUE, btree (a) INVALID
Access method: heap
postgres=# \d+ prt1_p3
Table "public.prt1_p3"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition of: prt1 FOR VALUES FROM (50) TO (75)
Partition constraint: ((a IS NOT NULL) AND (a >= 50) AND (a < 75))
Access method: heap
```````
As shown above, an error occurred while creating an index in the second partition.
It can be clearly seen that the index of the partitioned table is invalid
and the index of the first partition is normal, the second partition is invalid,
and the Third Partition index does not exist at all.
But we do the following tests again:
```
postgres=# truncate table prt1;
TRUNCATE TABLE
postgres=# INSERT INTO prt1 SELECT i, i % 25, to_char(i, 'FM0000') FROM generate_series(0, 74) i;
INSERT 0 75
postgres=# insert into prt1 values (51,1,'FM0051');
INSERT 0 1
postgres=# drop index idexpart_cic;
DROP INDEX
postgres=# create unique index CONCURRENTLY idexpart_cic on prt1 (a);
ERROR: could not create unique index "prt1_p3_a_idx"
DETAIL: Key (a)=(51) is duplicated.
postgres=# \d+ prt1
Partitioned table "public.prt1"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition key: RANGE (a)
Indexes:
"idexpart_cic" UNIQUE, btree (a) INVALID
Partitions: prt1_p1 FOR VALUES FROM (0) TO (25),
prt1_p2 FOR VALUES FROM (25) TO (50),
prt1_p3 FOR VALUES FROM (50) TO (75)
postgres=# \d+ prt1_p1
Table "public.prt1_p1"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition of: prt1 FOR VALUES FROM (0) TO (25)
Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 25))
Indexes:
"prt1_p1_a_idx" UNIQUE, btree (a)
Access method: heap
postgres=# \d+ prt1_p2
Table "public.prt1_p2"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition of: prt1 FOR VALUES FROM (25) TO (50)
Partition constraint: ((a IS NOT NULL) AND (a >= 25) AND (a < 50))
Indexes:
"prt1_p2_a_idx" UNIQUE, btree (a)
Access method: heap
postgres=# \d+ prt1_p3
Table "public.prt1_p3"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-------------------+-----------+----------+---------+----------+--------------+-------------
a | integer | | | | plain | |
b | integer | | | | plain | |
c | character varying | | | | extended | |
Partition of: prt1 FOR VALUES FROM (50) TO (75)
Partition constraint: ((a IS NOT NULL) AND (a >= 50) AND (a < 75))
Access method: heap
```
Now we can see that the first two partitions have indexes,
but the third partition has no indexes due to an error.
Therefore, in our first case, it should not be what we expected that the third partition has no index.
That is to say, when our CIC goes wrong, either roll back all or go down instead of stopping in the middle.
This is my shallow opinion, please take it as your reference.
Thank you very much,
Regards, Adger
------------------------------------------------------------------
发件人:Justin Pryzby <pryzby@telsasoft.com>
发送时间:2020年6月13日(星期六) 02:15
收件人:Michael Paquier <michael@paquier.xyz>; 李杰(慎追) <adger.lj@alibaba-inc.com>
抄 送:pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>
主 题:Re: how to create index concurrently on partitioned table
On Fri, Jun 12, 2020 at 04:17:34PM +0800, 李杰(慎追) wrote:
As we all know, CIC has three transactions. If we recursively in n partitioned tables,
it will become 3N transactions. If an error occurs in these transactions, we have too many things to deal...If an error occurs when an index is created in one of the partitions,
what should we do with our new index?
My (tentative) understanding is that these types of things should use a
"subtransaction" internally.. So if the toplevel transaction rolls back, its
changes are lost. In some cases, it might be desirable to not roll back, in
which case the user(client) should first create indexes (concurrently if
needed) on every child, and then later create index on parent (that has the
advtantage of working on older servers, too).
postgres=# SET client_min_messages=debug;
postgres=# CREATE INDEX ON t(i);
DEBUG: building index "t1_i_idx" on table "t1" with request for 1 parallel worker
DEBUG: index "t1_i_idx" can safely use deduplication
DEBUG: creating and filling new WAL file
DEBUG: done creating and filling new WAL file
DEBUG: creating and filling new WAL file
DEBUG: done creating and filling new WAL file
DEBUG: building index "t2_i_idx" on table "t2" with request for 1 parallel worker
^C2020-06-12 13:08:17.001 CDT [19291] ERROR: canceling statement due to user request
2020-06-12 13:08:17.001 CDT [19291] STATEMENT: CREATE INDEX ON t(i);
2020-06-12 13:08:17.001 CDT [27410] FATAL: terminating connection due to administrator command
2020-06-12 13:08:17.001 CDT [27410] STATEMENT: CREATE INDEX ON t(i);
Cancel request sent
If the index creation is interrupted at this point, no indexes will exist.
On Fri, Jun 12, 2020 at 04:06:28PM +0800, 李杰(慎追) wrote:
On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
I looked at CIC now and came up with the attached. All that's needed to allow
this case is to close the relation before recursing to partitions - it needs to
be closed before calling CommitTransactionCommand(). There's probably a better
way to write this, but I can't see that there's anything complicated about
handling partitioned tables.I'm so sorry about getting back late.
Thank you very much for helping me consider this issue.
I compiled the patch v1 you provided. And I patch v2-001 again to enter postgresql.
I got a coredump that was easy to reproduce. As follows:
I have been trying to get familiar with the source code of create index.
Can you solve this bug first? I will try my best to implement CIC with you.
Next, I will read your patchs v2-002 and v2-003.
Thanks, fixed.
On Fri, Jun 12, 2020 at 04:20:17PM +0900, Michael Paquier wrote:
When it comes to test behaviors specific to partitioning, there are in
my experience three things to be careful about and stress in the tests:
- Use at least two layers of partitioning.
- Include into the partition tree a partition that has no leaf
partitions.
- Test the commands on the top-most parent, a member in the middle of
the partition tree, the partition with no leaves, and one leaf, making
sure that relfilenode changes where it should and that partition trees
remain intact (you can use pg_partition_tree() for that.)
Added, thanks for looking.
--
Justin
On Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote:
As shown above, an error occurred while creating an index in the second partition.
It can be clearly seen that the index of the partitioned table is invalid
and the index of the first partition is normal, the second partition is invalid,
and the Third Partition index does not exist at all.
That's a problem. I really think that we should make the steps of the
concurrent operation consistent across all relations, meaning that all
the indexes should be created as invalid for all the parts of the
partition tree, including partitioned tables as well as their
partitions, in the same transaction. Then a second new transaction
gets used for the index build, followed by a third one for the
validation that switches the indexes to become valid.
--
Michael
That's a problem. I really think that we should make the steps of the
concurrent operation consistent across all relations, meaning that all
the indexes should be created as invalid for all the parts of the
partition tree, including partitioned tables as well as their
partitions, in the same transaction. Then a second new transaction
gets used for the index build, followed by a third one for the
validation that switches the indexes to become valid.
This is a good idea.
We should maintain the consistency of the entire partition table.
However, this is not a small change in the code.
May be we need to make a new design for DefineIndex function....
But most importantly, if we use three steps to implement CIC,
we will spend more time than ordinary tables, especially in high concurrency cases.
To wait for one of partitions which the users to use frequently,
the creation of other partition indexes is delayed.
Is it worth doing this?
Regards, Adger
------------------------------------------------------------------
发件人:Michael Paquier <michael@paquier.xyz>
发送时间:2020年6月15日(星期一) 20:37
收件人:李杰(慎追) <adger.lj@alibaba-inc.com>
抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>
主 题:Re: 回复:how to create index concurrently on partitioned table
On Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote:
As shown above, an error occurred while creating an index in the second partition.
It can be clearly seen that the index of the partitioned table is invalid
and the index of the first partition is normal, the second partition is invalid,
and the Third Partition index does not exist at all.
That's a problem. I really think that we should make the steps of the
concurrent operation consistent across all relations, meaning that all
the indexes should be created as invalid for all the parts of the
partition tree, including partitioned tables as well as their
partitions, in the same transaction. Then a second new transaction
gets used for the index build, followed by a third one for the
validation that switches the indexes to become valid.
--
Michael
On Mon, Jun 15, 2020 at 09:33:05PM +0800, 李杰(慎追) wrote:
This is a good idea.
We should maintain the consistency of the entire partition table.
However, this is not a small change in the code.
May be we need to make a new design for DefineIndex function....
Indeed. I have looked at the patch set a bit and here is the related
bit for CIC in 0001, meaning that you handle the basic index creation
for a partition tree within one transaction for each:
+ /*
+ * CIC needs to mark a partitioned table as VALID, which itself
+ * requires setting READY, which is unset for CIC (even though
+ * it's meaningless for an index without storage).
+ */
+ if (concurrent)
+ {
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+
+ CommandCounterIncrement();
+ index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);
}
I am afraid that this makes the error handling more complicated, with
risks of having inconsistent partition trees. That's the point you
raised. This one is going to need more thoughts.
But most importantly, if we use three steps to implement CIC,
we will spend more time than ordinary tables, especially in high
concurrency cases. To wait for one of partitions which the users to
use frequently, the creation of other partition indexes is delayed.
Is it worth doing this?
CIC is an operation that exists while allowing read and writes to
still happen in parallel, so that's fine IMO if it takes time. Now it
is true that it may not scale well so we should be careful with the
approach taken. Also, I think that the case of REINDEX would require
less work overall because we already have some code in place to gather
multiple indexes from one or more relations and work on these
consistently, all at once.
--
Michael
I am afraid that this makes the error handling more complicated, with
risks of having inconsistent partition trees. That's the point you
raised. This one is going to need more thoughts.
CIC is an operation that exists while allowing read and writes to
still happen in parallel, so that's fine IMO if it takes time. Now it
is true that it may not scale well so we should be careful with the
approach taken. Also, I think that the case of REINDEX would require
less work overall because we already have some code in place to gather
multiple indexes from one or more relations and work on these
consistently, all at once.
I'm with you on that.
(Scheme 1)
We can refer to the implementation in the ReindexRelationConcurrently,
in the three phases of the REINDEX CONCURRENTLY,
all indexes of the partitions are operated one by one in each phase.
In this way, we can maintain the consistency of the entire partitioned table index.
After we implement CIC in this way, we can also complete reindex partitioned table index concurrently (this is not done now.)
Looking back, let's talk about our original schema.
(Scheme 2)
If CIC is performed one by one on each partition,
how can we make it on subsequent partitions continue when an error occurs in the second partition?
If this problem can be solved, It's not that I can't accept it.
Because a partition CIC error is acceptable, you can reindex it later.
Pseudo indexes on partitioned tables are useless for real queries,
but the indexes on partitions are really useful.
Scheme 1, more elegant, can also implement reindex concurrently on partitioned table,
but the implementation is more complex.
Scheme 2: simple implementation, but not so friendly.
Hi Jsutin,
Which scheme do you think is more helpful to realize CIC?
------------------------------------------------------------------
发件人:Michael Paquier <michael@paquier.xyz>
发送时间:2020年6月16日(星期二) 09:02
收件人:李杰(慎追) <adger.lj@alibaba-inc.com>
抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>
主 题:Re: 回复:回复:how to create index concurrently on partitioned table
On Mon, Jun 15, 2020 at 09:33:05PM +0800, 李杰(慎追) wrote:
This is a good idea.
We should maintain the consistency of the entire partition table.
However, this is not a small change in the code.
May be we need to make a new design for DefineIndex function....
Indeed. I have looked at the patch set a bit and here is the related
bit for CIC in 0001, meaning that you handle the basic index creation
for a partition tree within one transaction for each:
+ /*
+ * CIC needs to mark a partitioned table as VALID, which itself
+ * requires setting READY, which is unset for CIC (even though
+ * it's meaningless for an index without storage).
+ */
+ if (concurrent)
+ {
+ PopActiveSnapshot();
+ CommitTransactionCommand();
+ StartTransactionCommand();
+ index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+
+ CommandCounterIncrement();
+ index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);
}
I am afraid that this makes the error handling more complicated, with
risks of having inconsistent partition trees. That's the point you
raised. This one is going to need more thoughts.
But most importantly, if we use three steps to implement CIC,
we will spend more time than ordinary tables, especially in high
concurrency cases. To wait for one of partitions which the users to
use frequently, the creation of other partition indexes is delayed.
Is it worth doing this?
CIC is an operation that exists while allowing read and writes to
still happen in parallel, so that's fine IMO if it takes time. Now it
is true that it may not scale well so we should be careful with the
approach taken. Also, I think that the case of REINDEX would require
less work overall because we already have some code in place to gather
multiple indexes from one or more relations and work on these
consistently, all at once.
--
Michael
We can refer to the implementation in the ReindexRelationConcurrently,
in the three phases of the REINDEX CONCURRENTLY,
all indexes of the partitions are operated one by one in each phase.
In this way, we can maintain the consistency of the entire partitioned table index.
After we implement CIC in this way, we can also complete reindex partitioned table index concurrently (this is not done now.)
After careful analysis, I found that there were two choices that embarrassed me.
Although we can handle the entire partition tree with one transaction in each of the three phases of CIC, just like ordinary tables.
However, I found a problem. If there are many partitions,
we may need to handle too many missing index entries when validate_index() .
Especially for the first partition, the time may have been long and many entries are missing.
In this case, why don't we put the second and third phase together into a transaction for each partition?
So, which schema do you think is better?
Choose to maintain consistency in all three phases,
or just maintain consistency in the first phase?
Thank you very much,
Regards, Adger
Import Notes
Reply to msg id not found: 41257272-104e-44d3-8cda-c1340bd01b7a.
On Wed, Jun 17, 2020 at 10:22:28PM +0800, 李杰(慎追) wrote:
However, I found a problem. If there are many partitions,
we may need to handle too many missing index entries when
validate_index(). Especially for the first partition, the time may
have been long and many entries are missing. In this case, why
don't we put the second and third phase together into a transaction
for each partition?
Not sure I am following. In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree. See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create(). It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
--
Michael
Not sure I am following. In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree. See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create(). It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
Let's look at this example:
A partition table has five partitions,
parttable: part1, part2, part3, part3 ,part5
We simply abstract the following definitions:
phase 1: index_create(), it is only registered in catalogs.
phase 2: index_concurrently_build(), Build the indexes.
phase 3: validate_index(), insert any missing index entries, mark the index as valid.
(schema 1)
```
StartTransaction one
parttable phase 1
part 1 phase 1
part 2 phase 1
part 3 phase 1
part 4 phase 1
part 5 phase 1
CommitTransaction
StartTransaction two
parttable phase 2part 1 phase 2
part 2 phase 2
part 3 phase 2 (error occurred )
part 4 phase 2
part 5 phase 2
CommitTransaction
StartTransaction three
parttable phase 3
part 1 phase 3
part 2 phase 3
part 3 phase 3
part 4 phase 4
part 5 phase 5 CommitTransaction
...
```
Now, the following steps cannot continue due to an error in Transaction two .
so, Transaction two roll back, Transaction three haven't started.
All of our indexes are invalid. In this way,
we ensure the strong consistency of indexes in the partition tree.
However, we need to rebuild all indexes when reindex.
(schema 2)
```
StartTransaction one
parttable phase 1
part 1 phase 1
part 2 phase 1
part 3 phase 1
part 4 phase 1
part 5 phase 1
CommitTransaction
StartTransaction two part 1 phase 2
part 1 phase 3
CommitTransaction
StartTransaction three part 2 phase 2
part 2 phase 3
CommitTransaction
StartTransaction fourpart 3 phase 2 (error occurred )
part 3 phase 3
CommitTransaction
StartTransaction five part 4 phase 2
part 4 phase 3
StartTransaction sixpart 5 phase 2
part 5 phase 3
CommitTransaction
StartTransaction sevenparttable phase 2
parttable phase 3
CommitTransaction
```
Now, the following steps cannot continue due to an error in Transaction four .
so, Transaction four roll back, Transactions behind Transaction 3 have not started
The indexes of the p1 and p2 partitions are available. Other indexes are invalid.
In reindex, we can ignore the rebuild of p1 and p2.
This seems better, although it seems to be inconsistent.
Do you think that scheme is more suitable for CIC?
Thank you very much,
Regards, Adger
------------------------------------------------------------------
发件人:Michael Paquier <michael@paquier.xyz>
发送时间:2020年6月18日(星期四) 10:41
收件人:李杰(慎追) <adger.lj@alibaba-inc.com>
抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>
主 题:Re: 回复:回复:回复:how to create index concurrently on partitioned table
On Wed, Jun 17, 2020 at 10:22:28PM +0800, 李杰(慎追) wrote:
However, I found a problem. If there are many partitions,
we may need to handle too many missing index entries when
validate_index(). Especially for the first partition, the time may
have been long and many entries are missing. In this case, why
don't we put the second and third phase together into a transaction
for each partition?
Not sure I am following. In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree. See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create(). It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
--
Michael
Not sure I am following. In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree. See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create(). It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
Some errors in the last email were not clearly expressed.
Let's look at this example:
A partition table has five partitions,
parttable: part1, part2, part3, part3 ,part5
We simply abstract the following definitions:
phase 1: index_create(), it is only registered in catalogs.
phase 2: index_concurrently_build(), Build the indexes.
phase 3: validate_index(), insert any missing index entries, mark the index as valid.
(scheme 1)
```
StartTransaction one
parttable phase 1
part 1 phase 1
part 2 phase 1
part 3 phase 1
part 4 phase 1
part 5 phase 1
CommitTransaction
StartTransaction two
parttable phase 2part 1 phase 2
part 2 phase 2
part 3 phase 2 (error occurred )
part 4 phase 2
part 5 phase 2
CommitTransaction
StartTransaction three
parttable phase 3
part 1 phase 3
part 2 phase 3
part 3 phase 3
part 4 phase 4
part 5 phase 5 CommitTransaction
...
```
Now, the following steps cannot continue due to an error in Transaction two .
so, Transaction two roll back, Transaction three haven't started.
All of our indexes are invalid. In this way,
we ensure the strong consistency of indexes in the partition tree.
However, we need to rebuild all indexes when reindex.
(scheme 2)
```
StartTransaction one
parttable phase 1
part 1 phase 1
part 2 phase 1
part 3 phase 1
part 4 phase 1
part 5 phase 1
CommitTransaction
StartTransaction two part 1 phase 2
CommitTransaction
StartTransaction three
part 1 phase 3
CommitTransaction
StartTransaction fourpart 2 phase 2
CommitTransaction
StartTransaction five
part 2 phase 3
CommitTransaction
StartTransaction sixpart 3 phase 2 (error occurred )
CommitTransaction
StartTransaction seven
part 3 phase 3
CommitTransaction
StartTransaction xxpart 4 phase 2
CommitTransaction
StartTransaction xx
part 4 phase 3
CommitTransaction
StartTransaction xxpart 5 phase 2
CommitTransaction
StartTransaction xx
part 5 phase 3
CommitTransaction
StartTransaction xxparttable phase 2
CommitTransaction
StartTransaction xx
parttable phase 3
CommitTransaction
```
Now, the following steps cannot continue due to an error in Transaction six .
so, Transaction six roll back, Transactions behind Transaction six have not started
The indexes of the p1 and p2 partitions are available. Other indexes are invalid.
In reindex, we can ignore the rebuild of p1 and p2.
Especially when there are many partitions,
this can save the rebuild of some partition indexes,
This seems better, although it seems to be inconsistent.
Do you think that scheme is more suitable for CIC?
Thank you very much,
Regards, Adger
------------------------------------------------------------------
发件人:李杰(慎追) <adger.lj@alibaba-inc.com>
发送时间:2020年6月18日(星期四) 14:37
收件人:Michael Paquier <michael@paquier.xyz>
抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>
主 题:回复:回复:回复:回复:how to create index concurrently on partitioned table
Not sure I am following. In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree. See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create(). It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
Let's look at this example:
A partition table has five partitions,
parttable: part1, part2, part3, part3 ,part5
We simply abstract the following definitions:
phase 1: index_create(), it is only registered in catalogs.
phase 2: index_concurrently_build(), Build the indexes.
phase 3: validate_index(), insert any missing index entries, mark the index as valid.
(schema 1)
```
StartTransaction one
parttable phase 1
part 1 phase 1
part 2 phase 1
part 3 phase 1
part 4 phase 1
part 5 phase 1
CommitTransaction
StartTransaction two
parttable phase 2part 1 phase 2
part 2 phase 2
part 3 phase 2 (error occurred )
part 4 phase 2
part 5 phase 2
CommitTransaction
StartTransaction three
parttable phase 3
part 1 phase 3
part 2 phase 3
part 3 phase 3
part 4 phase 4
part 5 phase 5 CommitTransaction
...
```
Now, the following steps cannot continue due to an error in Transaction two .
so, Transaction two roll back, Transaction three haven't started.
All of our indexes are invalid. In this way,
we ensure the strong consistency of indexes in the partition tree.
However, we need to rebuild all indexes when reindex.
(schema 2)
```
StartTransaction one
parttable phase 1
part 1 phase 1
part 2 phase 1
part 3 phase 1
part 4 phase 1
part 5 phase 1
CommitTransaction
StartTransaction two part 1 phase 2
part 1 phase 3
CommitTransaction
StartTransaction three part 2 phase 2
part 2 phase 3
CommitTransaction
StartTransaction fourpart 3 phase 2 (error occurred )
part 3 phase 3
CommitTransaction
StartTransaction five part 4 phase 2
part 4 phase 3
StartTransaction sixpart 5 phase 2
part 5 phase 3
CommitTransaction
StartTransaction sevenparttable phase 2
parttable phase 3
CommitTransaction
```
Now, the following steps cannot continue due to an error in Transaction four .
so, Transaction four roll back, Transactions behind Transaction 3 have not started
The indexes of the p1 and p2 partitions are available. Other indexes are invalid.
In reindex, we can ignore the rebuild of p1 and p2.
This seems better, although it seems to be inconsistent.
Do you think that scheme is more suitable for CIC?
Thank you very much,
Regards, Adger
------------------------------------------------------------------
发件人:Michael Paquier <michael@paquier.xyz>
发送时间:2020年6月18日(星期四) 10:41
收件人:李杰(慎追) <adger.lj@alibaba-inc.com>
抄 送:Justin Pryzby <pryzby@telsasoft.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org>; 曾文旌(义从) <wenjing.zwj@alibaba-inc.com>; Alvaro Herrera <alvherre@2ndquadrant.com>
主 题:Re: 回复:回复:回复:how to create index concurrently on partitioned table
On Wed, Jun 17, 2020 at 10:22:28PM +0800, 李杰(慎追) wrote:
However, I found a problem. If there are many partitions,
we may need to handle too many missing index entries when
validate_index(). Especially for the first partition, the time may
have been long and many entries are missing. In this case, why
don't we put the second and third phase together into a transaction
for each partition?
Not sure I am following. In the case of REINDEX, it seems to me that
the calls to validate_index() and index_concurrently_build() can
happen in a separate transaction for each index, as long as all the
calls to index_concurrently_swap() are grouped together in the same
transaction to make sure that index partition trees are switched
consistently when all entries are swapped from an invalid state to a
valid state, because the swapping phase is also when we attach a fresh
index to a partition tree. See also index_concurrently_create_copy()
where we don't set parentIndexRelid for the lower call to
index_create(). It would be good of course to check that when
swapping we have the code to handle that for a lot of indexes at
once.
--
Michael