Patch: Global Unique Index
Patch: Global Unique Index
“Global unique index” in our definition is a unique index on a partitioned table that can ensure cross-partition uniqueness using a non-partition key. This work is inspired by this email thread, “Proposal: Global Index” started back in 2019 (Link below). My colleague David and I took a different approach to implement the feature that ensures uniqueness constraint spanning multiple partitions. We achieved this mainly by using application logics without heavy modification to current Postgres’s partitioned table/index structure. In other words, a global unique index and a regular partitioned index are essentially the same in terms of their storage structure except that one can do cross-partition uniqueness check, the other cannot.
/messages/by-id/CALtqXTcurqy1PKXzP9XO=ofLLA5wBSo77BnUnYVEZpmcA3V0ag@mail.gmail.com
- Patch -
The attached patches were generated based on commit `85d8b30724c0fd117a683cc72706f71b28463a05` on master branch.
- Benefits of global unique index -
1. Ensure uniqueness spanning all partitions using a non-partition key column
2. Allow user to create a unique index on a non-partition key column without the need to include partition key (current Postgres enforces this)
3. Query performance increase when using a single unique non-partition key column
- Supported Features -
1. Global unique index is supported only on btree index type
2. Global unique index is useful only when created on a partitioned table.
3. Cross-partition uniqueness check with CREATE UNIQUE INDEX in serial and parallel mode
4. Cross-partition uniqueness check with ATTACH in serial and parallel mode
5. Cross-partition uniqueness check when INSERT and UPDATE
- Not-supported Features -
1. Global uniqueness check with Sub partition tables is not yet supported as we do not have immediate use case and it may involve majoy change in current implementation
- Global unique index syntax -
A global unique index can be created with "GLOBAL" and "UNIQUE" clauses in a "CREATE INDEX" statement run against a partitioned table. For example,
CREATE UNIQUE INDEX global_index ON idxpart(bid) GLOBAL;
- New Relkind: RELKIND_GLOBAL_INDEX -
When a global unique index is created on a partitioned table, its relkind is RELKIND_PARTITIONED_INDEX (I). This is the same as creating a regular index. Then Postgres will recursively create index on each child partition, except now the relkind will be set as RELKIND_GLOBAL_INDEX (g) instead of RELKIND_INDEX (i). This new relkind, along with uniqueness flag are needed for cross-partition uniqueness check later.
- Create a global unique index -
To create a regular unique index on a partitioned table, Postgres has to perform heap scan and sorting on every child partition. Uniqueness check happens during the sorting phase and will raise an error if multiple tuples with the same index key are sorted together. To achieve global uniqueness check, we make Postgres perform the sorting after all of the child partitions have been scanned instead of on the "sort per partition" fashion. In otherwords, the sorting only happens once at the very end and it sorts the tuples coming from all the partitions and therefore can ensure global uniqueness.
In parallel index build case, the idea is the same, except that the tuples will be put into shared file set (temp files) on disk instead of in memory to ensure other workers can share the sort results. At the end of the very last partition build, we make Postgres take over all the temp files and perform a final merge sort to ensure global uniqueness.
Example:
CREATE TABLE gidx_part(a int, b int, c text) PARTITION BY RANGE (a);
CREATE TABLE gidx_part1 PARTITION OF gidx_part FOR VALUES FROM (0) to (10);
CREATE TABLE gidx_part2 PARTITION OF gidx_part FOR VALUES FROM (10) to (20);
INSERT INTO gidx_part values(5, 5, 'test');
INSERT INTO gidx_part values(15, 5, 'test');
CREATE UNIQUE INDEX global_unique_idx ON gidx_part(b) GLOBAL;
ERROR: could not create unique index "gidx_part1_b_idx"
DETAIL: Key (b)=(5) is duplicated.
- INSERT and UPDATE -
For every new tuple inserted or updated, Postgres attempts to fetch the same tuple from current partition to determine if a duplicate already exists. In the global unique index case, we make Postgres attempt to fetch the same tuple from other partitions as well as the current partition. If a duplicate is found, global uniqueness is violated and an error is raised.
Example:
CREATE TABLE gidx_part (a int, b int, c text) PARTITION BY RANGE (a);
CREATE TABLE gidx_part1 partition of gidx_part FOR VALUES FROM (0) TO (10);
CREATE TABLE gidx_part2 partition of gidx_part FOR VALUES FROM (10) TO (20);
CREATE UNIQUE INDEX global_unique_idx ON gidx_part USING BTREE(b) GLOBAL;
INSERT INTO gidx_part values(5, 5, 'test');
INSERT INTO gidx_part values(15, 5, 'test');
ERROR: duplicate key value violates unique constraint "gidx_part1_b_idx"
DETAIL: Key (b)=(5) already exists.
- ATTACH -
The new partition-to-be may already contain a regular unique index or contain no index at all. If it has no index, Postgres will create a similar index for it upon ATTACH. If the partitioned table has a global unique index, a new global unique index is automatically created on the partition-to-be upon ATTACH, and it will run a global uniqueness check between all current partitions and the partition-to-be.
If the partition-to-be already contains a regular unique index, Postgres will change its relkind from RELKIND_INDEX to RELKIND_GLOBAL_INDEX and run a global uniqueness check between all current partitions and the partition-to-be. No new index is created in this case
If a duplicate record is found, global uniqueness is violated and an error is raised.
- DETACH -
Since we retain the same partitioned structure, detaching a partition with global unique index is straightforward. Upon DETACH, Postgres will change its relkind from RELKIND_GLOBAL_INDEX to RELKIND_INDEX and remove their inheritance relationship as usual.
- Optimizer, query planning and vacuum -
Since no major modification is done on global unique index's structure and storage, it works in the same way as a regular partitioned index. No major change is required to be done on optimizer, planner and vacuum process as they should work in the same way as regular index.
- REINDX -
A global unique index can be reindexed normally just like a regular index. No cross-partition uniqueness check is performed while a global unique index is being rebuilt. This is okay as long as it acquired a exclusive lock on the index relation.
- Benchmark Result -
Using pgbench with 200 partitions running SELECT and READ-WRITE tests with a unique non-partition key, we observe orders of magnitude higher TPS compared to a regular unique index built with partition key restriction (multicolumn index).
- TODOs -
Since this is a POC patch, there is several TODOs related to user experience such as:
1. Raise error when user uses CREATE UNIQUE INDEX with ON ONLY clause
2. Raise error when user tries to create a global unique index directly on a child partition
3. ... maybe more
We will work on these at a later time.
thank you
Please let us know your thoughts or questions about the feature.
All comments are welcome and greatly appreciated!
David and Cary
============================
HighGo Software Canada
Attachments:
0001-support-global-unique-index-with-non-partition-key.patchapplication/octet-stream; name=0001-support-global-unique-index-with-non-partition-key.patchDownload
From 218dfb0053b0d47f59de8a63ed1a3abdbde14c19 Mon Sep 17 00:00:00 2001
From: David Zhang <david.zhang@highgo.ca>
Date: Thu, 17 Nov 2022 12:20:34 -0800
Subject: [PATCH 1/4] support global unique index with non-partition key
---
contrib/pageinspect/btreefuncs.c | 7 +++--
doc/src/sgml/ref/create_index.sgml | 13 +++++++++
src/backend/access/common/reloptions.c | 1 +
src/backend/access/index/indexam.c | 1 +
src/backend/access/table/table.c | 1 +
src/backend/catalog/aclchk.c | 3 ++
src/backend/catalog/dependency.c | 1 +
src/backend/catalog/heap.c | 4 ++-
src/backend/catalog/index.c | 11 +++++--
src/backend/catalog/objectaddress.c | 4 +++
src/backend/catalog/pg_class.c | 2 ++
src/backend/commands/cluster.c | 6 ++--
src/backend/commands/indexcmds.c | 28 +++++++++++++++---
src/backend/commands/tablecmds.c | 36 ++++++++++++++++++++---
src/backend/optimizer/util/plancat.c | 3 +-
src/backend/parser/gram.y | 13 +++++++--
src/backend/parser/parse_utilcmd.c | 1 +
src/backend/utils/adt/amutils.c | 1 +
src/backend/utils/cache/relcache.c | 17 ++++++++---
src/bin/psql/describe.c | 17 ++++++++++-
src/include/catalog/index.h | 1 +
src/include/catalog/pg_class.h | 2 ++
src/include/nodes/parsenodes.h | 1 +
src/test/regress/expected/indexing.out | 40 ++++++++++++++++++++++++++
src/test/regress/sql/indexing.sql | 11 +++++++
25 files changed, 199 insertions(+), 26 deletions(-)
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 9375d55e14..fea5a3033b 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -51,6 +51,7 @@ PG_FUNCTION_INFO_V1(bt_page_stats);
#define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
#define DatumGetItemPointer(X) ((ItemPointer) DatumGetPointer(X))
#define ItemPointerGetDatum(X) PointerGetDatum(X)
+#define IS_GLOBAL_INDEX(r) ((r)->rd_rel->relkind == RELKIND_GLOBAL_INDEX)
/* note: BlockNumber is unsigned, hence can't be negative */
#define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
@@ -205,7 +206,7 @@ bt_page_stats_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
rel = relation_openrv(relrv, AccessShareLock);
- if (!IS_INDEX(rel) || !IS_BTREE(rel))
+ if ((!IS_INDEX(rel) && !IS_GLOBAL_INDEX(rel)) || !IS_BTREE(rel))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("\"%s\" is not a %s index",
@@ -473,7 +474,7 @@ bt_page_items_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
rel = relation_openrv(relrv, AccessShareLock);
- if (!IS_INDEX(rel) || !IS_BTREE(rel))
+ if ((!IS_INDEX(rel) && !IS_GLOBAL_INDEX(rel)) || !IS_BTREE(rel))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("\"%s\" is not a %s index",
@@ -709,7 +710,7 @@ bt_metap(PG_FUNCTION_ARGS)
relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
rel = relation_openrv(relrv, AccessShareLock);
- if (!IS_INDEX(rel) || !IS_BTREE(rel))
+ if ((!IS_INDEX(rel) && !IS_GLOBAL_INDEX(rel)) || !IS_BTREE(rel))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("\"%s\" is not a %s index",
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 40986aa502..5444c096e1 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -28,6 +28,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
[ WITH ( <replaceable class="parameter">storage_parameter</replaceable> [= <replaceable class="parameter">value</replaceable>] [, ... ] ) ]
[ TABLESPACE <replaceable class="parameter">tablespace_name</replaceable> ]
[ WHERE <replaceable class="parameter">predicate</replaceable> ]
+ [ GLOBAL ]
</synopsis>
</refsynopsisdiv>
@@ -380,6 +381,18 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>GLOBAL</literal></term>
+ <listitem>
+ <para>
+ Used with <literal>UNIQUE</literal> to enable cross-partition
+ uniqueness check on a partitioned table. Attempts to insert or
+ update data which would result in duplicate entries in other
+ partitions as a whole will generate an error.
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
<refsect2 id="sql-createindex-storage-parameters" xreflabel="Index Storage Parameters">
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 75b7344891..70ad7ffe8f 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1410,6 +1410,7 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
options = view_reloptions(datum, false);
break;
case RELKIND_INDEX:
+ case RELKIND_GLOBAL_INDEX:
case RELKIND_PARTITIONED_INDEX:
options = index_reloptions(amoptions, datum, false);
break;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index fe80b8b0ba..5bf36cc686 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -136,6 +136,7 @@ index_open(Oid relationId, LOCKMODE lockmode)
r = relation_open(relationId, lockmode);
if (r->rd_rel->relkind != RELKIND_INDEX &&
+ r->rd_rel->relkind != RELKIND_GLOBAL_INDEX &&
r->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
diff --git a/src/backend/access/table/table.c b/src/backend/access/table/table.c
index 7e94232f01..0f9570c894 100644
--- a/src/backend/access/table/table.c
+++ b/src/backend/access/table/table.c
@@ -138,6 +138,7 @@ static inline void
validate_relation_kind(Relation r)
{
if (r->rd_rel->relkind == RELKIND_INDEX ||
+ r->rd_rel->relkind == RELKIND_GLOBAL_INDEX ||
r->rd_rel->relkind == RELKIND_PARTITIONED_INDEX ||
r->rd_rel->relkind == RELKIND_COMPOSITE_TYPE)
ereport(ERROR,
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
index 42360d37ca..79e14ca356 100644
--- a/src/backend/catalog/aclchk.c
+++ b/src/backend/catalog/aclchk.c
@@ -1819,6 +1819,7 @@ ExecGrant_Relation(InternalGrant *istmt)
/* Not sensible to grant on an index */
if (pg_class_tuple->relkind == RELKIND_INDEX ||
+ pg_class_tuple->relkind == RELKIND_GLOBAL_INDEX ||
pg_class_tuple->relkind == RELKIND_PARTITIONED_INDEX)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
@@ -5950,6 +5951,7 @@ recordExtObjInitPriv(Oid objoid, Oid classoid)
* restrictions in ALTER EXTENSION ADD, but let's check anyway.)
*/
if (pg_class_tuple->relkind == RELKIND_INDEX ||
+ pg_class_tuple->relkind == RELKIND_GLOBAL_INDEX ||
pg_class_tuple->relkind == RELKIND_PARTITIONED_INDEX ||
pg_class_tuple->relkind == RELKIND_COMPOSITE_TYPE)
{
@@ -6244,6 +6246,7 @@ removeExtObjInitPriv(Oid objoid, Oid classoid)
* restrictions in ALTER EXTENSION DROP, but let's check anyway.)
*/
if (pg_class_tuple->relkind == RELKIND_INDEX ||
+ pg_class_tuple->relkind == RELKIND_GLOBAL_INDEX ||
pg_class_tuple->relkind == RELKIND_PARTITIONED_INDEX ||
pg_class_tuple->relkind == RELKIND_COMPOSITE_TYPE)
{
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7f3e64b5ae..5ec1f32efa 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1399,6 +1399,7 @@ doDeletion(const ObjectAddress *object, int flags)
char relKind = get_rel_relkind(object->objectId);
if (relKind == RELKIND_INDEX ||
+ relKind == RELKIND_GLOBAL_INDEX ||
relKind == RELKIND_PARTITIONED_INDEX)
{
bool concurrent = ((flags & PERFORM_DELETION_CONCURRENTLY) != 0);
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 5b49cc5a09..54a81c11dc 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -316,7 +316,8 @@ heap_create(const char *relname,
* user defined relation, not a system one.
*/
if (!allow_system_table_mods &&
- ((IsCatalogNamespace(relnamespace) && relkind != RELKIND_INDEX) ||
+ ((IsCatalogNamespace(relnamespace) &&
+ (relkind != RELKIND_INDEX && relkind != RELKIND_GLOBAL_INDEX)) ||
IsToastNamespace(relnamespace)) &&
IsNormalProcessingMode())
ereport(ERROR,
@@ -1300,6 +1301,7 @@ heap_create_with_catalog(const char *relname,
if (!(relkind == RELKIND_SEQUENCE ||
relkind == RELKIND_TOASTVALUE ||
relkind == RELKIND_INDEX ||
+ relkind == RELKIND_GLOBAL_INDEX ||
relkind == RELKIND_PARTITIONED_INDEX))
{
Oid new_array_oid;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 61f1d3926a..46a2fb6cc7 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -731,6 +731,7 @@ index_create(Relation heapRelation,
bool invalid = (flags & INDEX_CREATE_INVALID) != 0;
bool concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
bool partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+ bool globalindex = (flags & INDEX_CREATE_GLOBAL) != 0;
char relkind;
TransactionId relfrozenxid;
MultiXactId relminmxid;
@@ -742,7 +743,10 @@ index_create(Relation heapRelation,
/* partitioned indexes must never be "built" by themselves */
Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
- relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
+ if (globalindex)
+ relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_GLOBAL_INDEX;
+ else
+ relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
pg_class = table_open(RelationRelationId, RowExclusiveLock);
@@ -918,7 +922,7 @@ index_create(Relation heapRelation,
binary_upgrade_next_index_pg_class_oid = InvalidOid;
/* Override the index relfilenumber */
- if ((relkind == RELKIND_INDEX) &&
+ if ((relkind == RELKIND_INDEX || relkind == RELKIND_GLOBAL_INDEX) &&
(!RelFileNumberIsValid(binary_upgrade_next_index_pg_class_relfilenumber)))
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
@@ -2883,7 +2887,8 @@ index_update_stats(Relation rel,
BlockNumber relpages = RelationGetNumberOfBlocks(rel);
BlockNumber relallvisible;
- if (rd_rel->relkind != RELKIND_INDEX)
+ if (rd_rel->relkind != RELKIND_INDEX &&
+ rd_rel->relkind != RELKIND_GLOBAL_INDEX)
visibilitymap_count(rel, &relallvisible, NULL);
else /* don't bother for indexes */
relallvisible = 0;
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index c7de7232b8..0bfab32b65 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -1389,6 +1389,7 @@ get_relation_by_qualified_name(ObjectType objtype, List *object,
{
case OBJECT_INDEX:
if (relation->rd_rel->relkind != RELKIND_INDEX &&
+ relation->rd_rel->relkind != RELKIND_GLOBAL_INDEX &&
relation->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
@@ -4172,6 +4173,7 @@ getRelationDescription(StringInfo buffer, Oid relid, bool missing_ok)
relname);
break;
case RELKIND_INDEX:
+ case RELKIND_GLOBAL_INDEX:
case RELKIND_PARTITIONED_INDEX:
appendStringInfo(buffer, _("index %s"),
relname);
@@ -4706,6 +4708,7 @@ getRelationTypeDescription(StringInfo buffer, Oid relid, int32 objectSubId,
appendStringInfoString(buffer, "table");
break;
case RELKIND_INDEX:
+ case RELKIND_GLOBAL_INDEX:
case RELKIND_PARTITIONED_INDEX:
appendStringInfoString(buffer, "index");
break;
@@ -6187,6 +6190,7 @@ get_relkind_objtype(char relkind)
case RELKIND_PARTITIONED_TABLE:
return OBJECT_TABLE;
case RELKIND_INDEX:
+ case RELKIND_GLOBAL_INDEX:
case RELKIND_PARTITIONED_INDEX:
return OBJECT_INDEX;
case RELKIND_SEQUENCE:
diff --git a/src/backend/catalog/pg_class.c b/src/backend/catalog/pg_class.c
index b696fa2afd..39f410906c 100644
--- a/src/backend/catalog/pg_class.c
+++ b/src/backend/catalog/pg_class.c
@@ -45,6 +45,8 @@ errdetail_relkind_not_supported(char relkind)
return errdetail("This operation is not supported for partitioned tables.");
case RELKIND_PARTITIONED_INDEX:
return errdetail("This operation is not supported for partitioned indexes.");
+ case RELKIND_GLOBAL_INDEX:
+ return errdetail("This operation is not supported for global indexes.");
default:
elog(ERROR, "unrecognized relkind: '%c'", relkind);
return 0;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 1976a373ef..eff371ff12 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1199,7 +1199,8 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
*/
/* set rel1's frozen Xid and minimum MultiXid */
- if (relform1->relkind != RELKIND_INDEX)
+ if (relform1->relkind != RELKIND_INDEX &&
+ relform1->relkind != RELKIND_GLOBAL_INDEX)
{
Assert(!TransactionIdIsValid(frozenXid) ||
TransactionIdIsNormal(frozenXid));
@@ -1686,7 +1687,8 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
RelToCluster *rtc;
/* consider only leaf indexes */
- if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
+ if (get_rel_relkind(indexrelid) != RELKIND_INDEX &&
+ get_rel_relkind(indexrelid) != RELKIND_GLOBAL_INDEX)
continue;
/* Silently skip partitions which the user has no access to. */
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 659e189549..099bf68e77 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -711,8 +711,20 @@ DefineIndex(Oid relationId,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot create exclusion constraints on partitioned table \"%s\"",
RelationGetRelationName(rel))));
+
+ if (stmt->global_index && !stmt->unique)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot create global index without unique on partitioned table \"%s\"",
+ RelationGetRelationName(rel))));
}
+ if (stmt->global_index && rel->rd_rel->relkind == RELKIND_RELATION && !rel->rd_rel->relispartition)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot create global index on non-partitioned table \"%s\"",
+ RelationGetRelationName(rel))));
+
/*
* Don't try to CREATE INDEX on temp tables of other backends.
*/
@@ -923,7 +935,7 @@ DefineIndex(Oid relationId,
* We could lift this limitation if we had global indexes, but those have
* their own problems, so this is a useful feature combination.
*/
- if (partitioned && (stmt->unique || stmt->primary))
+ if (partitioned && (stmt->unique || stmt->primary) && !stmt->global_index)
{
PartitionKey key = RelationGetPartitionKey(rel);
const char *constraint_type;
@@ -1026,7 +1038,7 @@ DefineIndex(Oid relationId,
}
}
- if (!found)
+ if (!found && !stmt->global_index)
{
Form_pg_attribute att;
@@ -1145,6 +1157,8 @@ DefineIndex(Oid relationId,
if (pd->nparts != 0)
flags |= INDEX_CREATE_INVALID;
}
+ if (stmt->global_index)
+ flags |= INDEX_CREATE_GLOBAL;
if (stmt->deferrable)
constr_flags |= INDEX_CONSTR_CREATE_DEFERRABLE;
@@ -2784,6 +2798,7 @@ RangeVarCallbackForReindexIndex(const RangeVar *relation,
if (!relkind)
return;
if (relkind != RELKIND_INDEX &&
+ relkind != RELKIND_GLOBAL_INDEX &&
relkind != RELKIND_PARTITIONED_INDEX)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
@@ -3176,6 +3191,7 @@ ReindexPartitions(Oid relid, ReindexParams *params, bool isTopLevel)
continue;
Assert(partkind == RELKIND_INDEX ||
+ partkind == RELKIND_GLOBAL_INDEX ||
partkind == RELKIND_RELATION);
/* Save partition OID */
@@ -3268,7 +3284,8 @@ ReindexMultipleInternal(List *relids, ReindexParams *params)
(void) ReindexRelationConcurrently(relid, &newparams);
/* ReindexRelationConcurrently() does the verbose output */
}
- else if (relkind == RELKIND_INDEX)
+ else if (relkind == RELKIND_INDEX ||
+ relkind == RELKIND_GLOBAL_INDEX)
{
ReindexParams newparams = *params;
@@ -3527,6 +3544,7 @@ ReindexRelationConcurrently(Oid relationOid, ReindexParams *params)
break;
}
case RELKIND_INDEX:
+ case RELKIND_GLOBAL_INDEX:
{
Oid heapId = IndexGetRelation(relationOid,
(params->options & REINDEXOPT_MISSING_OK) != 0);
@@ -4127,7 +4145,8 @@ ReindexRelationConcurrently(Oid relationOid, ReindexParams *params)
/* Log what we did */
if ((params->options & REINDEXOPT_VERBOSE) != 0)
{
- if (relkind == RELKIND_INDEX)
+ if (relkind == RELKIND_INDEX ||
+ relkind == RELKIND_GLOBAL_INDEX)
ereport(INFO,
(errmsg("index \"%s.%s\" was reindexed",
relationNamespace, relationName),
@@ -4180,6 +4199,7 @@ IndexSetParentIndex(Relation partitionIdx, Oid parentOid)
/* Make sure this is an index */
Assert(partitionIdx->rd_rel->relkind == RELKIND_INDEX ||
+ partitionIdx->rd_rel->relkind == RELKIND_GLOBAL_INDEX ||
partitionIdx->rd_rel->relkind == RELKIND_PARTITIONED_INDEX);
/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6007e10730..b16a1180d8 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -294,6 +294,12 @@ static const struct dropmsgstrings dropmsgstringarray[] = {
gettext_noop("index \"%s\" does not exist, skipping"),
gettext_noop("\"%s\" is not an index"),
gettext_noop("Use DROP INDEX to remove an index.")},
+ {RELKIND_GLOBAL_INDEX,
+ ERRCODE_UNDEFINED_OBJECT,
+ gettext_noop("index \"%s\" does not exist"),
+ gettext_noop("index \"%s\" does not exist, skipping"),
+ gettext_noop("\"%s\" is not an index"),
+ gettext_noop("Use DROP INDEX to remove an index.")},
{'\0', 0, NULL, NULL, NULL, NULL}
};
@@ -320,6 +326,7 @@ struct DropRelationCallbackState
#define ATT_FOREIGN_TABLE 0x0020
#define ATT_PARTITIONED_INDEX 0x0040
#define ATT_SEQUENCE 0x0080
+#define ATT_GLOBAL_INDEX 0x0100
/*
* ForeignTruncateInfo
@@ -1564,6 +1571,8 @@ RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid, Oid oldRelOid,
expected_relkind = RELKIND_RELATION;
else if (classform->relkind == RELKIND_PARTITIONED_INDEX)
expected_relkind = RELKIND_INDEX;
+ else if (classform->relkind == RELKIND_GLOBAL_INDEX)
+ expected_relkind = RELKIND_GLOBAL_INDEX;
else
expected_relkind = classform->relkind;
@@ -1584,7 +1593,8 @@ RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid, Oid oldRelOid,
* only concerns indexes of toast relations that became invalid during a
* REINDEX CONCURRENTLY process.
*/
- if (IsSystemClass(relOid, classform) && classform->relkind == RELKIND_INDEX)
+ if (IsSystemClass(relOid, classform) && (classform->relkind == RELKIND_INDEX ||
+ classform->relkind == RELKIND_GLOBAL_INDEX))
{
HeapTuple locTuple;
Form_pg_index indexform;
@@ -1623,7 +1633,8 @@ RangeVarCallbackForDropRelation(const RangeVar *rel, Oid relOid, Oid oldRelOid,
* entry, though --- the relation may have been dropped. Note that this
* code will execute for either plain or partitioned indexes.
*/
- if (expected_relkind == RELKIND_INDEX &&
+ if ((expected_relkind == RELKIND_INDEX ||
+ expected_relkind == RELKIND_GLOBAL_INDEX) &&
relOid != oldRelOid)
{
state->heapOid = IndexGetRelation(relOid, true);
@@ -3406,6 +3417,7 @@ renameatt_check(Oid myrelid, Form_pg_class classform, bool recursing)
relkind != RELKIND_MATVIEW &&
relkind != RELKIND_COMPOSITE_TYPE &&
relkind != RELKIND_INDEX &&
+ relkind != RELKIND_GLOBAL_INDEX &&
relkind != RELKIND_PARTITIONED_INDEX &&
relkind != RELKIND_FOREIGN_TABLE &&
relkind != RELKIND_PARTITIONED_TABLE)
@@ -3837,6 +3849,7 @@ RenameRelation(RenameStmt *stmt)
*/
relkind = get_rel_relkind(relid);
obj_is_index = (relkind == RELKIND_INDEX ||
+ relkind == RELKIND_GLOBAL_INDEX ||
relkind == RELKIND_PARTITIONED_INDEX);
if (obj_is_index || is_index_stmt == obj_is_index)
break;
@@ -3902,6 +3915,7 @@ RenameRelationInternal(Oid myrelid, const char *newrelname, bool is_internal, bo
*/
Assert(!is_index ||
is_index == (targetrelation->rd_rel->relkind == RELKIND_INDEX ||
+ targetrelation->rd_rel->relkind == RELKIND_GLOBAL_INDEX ||
targetrelation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX));
/*
@@ -3929,6 +3943,7 @@ RenameRelationInternal(Oid myrelid, const char *newrelname, bool is_internal, bo
* Also rename the associated constraint, if any.
*/
if (targetrelation->rd_rel->relkind == RELKIND_INDEX ||
+ targetrelation->rd_rel->relkind == RELKIND_GLOBAL_INDEX ||
targetrelation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
{
Oid constraintId = get_index_constraint(myrelid);
@@ -4013,6 +4028,7 @@ CheckTableNotInUse(Relation rel, const char *stmt)
stmt, RelationGetRelationName(rel))));
if (rel->rd_rel->relkind != RELKIND_INDEX &&
+ rel->rd_rel->relkind != RELKIND_GLOBAL_INDEX &&
rel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX &&
AfterTriggerPendingOnRel(RelationGetRelid(rel)))
ereport(ERROR,
@@ -6271,6 +6287,9 @@ ATSimplePermissions(AlterTableType cmdtype, Relation rel, int allowed_targets)
case RELKIND_INDEX:
actual_target = ATT_INDEX;
break;
+ case RELKIND_GLOBAL_INDEX:
+ actual_target = ATT_GLOBAL_INDEX;
+ break;
case RELKIND_PARTITIONED_INDEX:
actual_target = ATT_PARTITIONED_INDEX;
break;
@@ -8061,6 +8080,7 @@ ATExecSetStatistics(Relation rel, const char *colName, int16 colNum, Node *newVa
* column numbers could contain gaps if columns are later dropped.
*/
if (rel->rd_rel->relkind != RELKIND_INDEX &&
+ rel->rd_rel->relkind != RELKIND_GLOBAL_INDEX &&
rel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX &&
!colName)
ereport(ERROR,
@@ -8122,6 +8142,7 @@ ATExecSetStatistics(Relation rel, const char *colName, int16 colNum, Node *newVa
colName)));
if (rel->rd_rel->relkind == RELKIND_INDEX ||
+ rel->rd_rel->relkind == RELKIND_GLOBAL_INDEX ||
rel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
{
if (attnum > rel->rd_index->indnkeyatts)
@@ -13736,6 +13757,7 @@ ATExecChangeOwner(Oid relationOid, Oid newOwnerId, bool recursing, LOCKMODE lock
/* ok to change owner */
break;
case RELKIND_INDEX:
+ case RELKIND_GLOBAL_INDEX:
if (!recursing)
{
/*
@@ -13886,6 +13908,7 @@ ATExecChangeOwner(Oid relationOid, Oid newOwnerId, bool recursing, LOCKMODE lock
*/
if (tuple_class->relkind != RELKIND_COMPOSITE_TYPE &&
tuple_class->relkind != RELKIND_INDEX &&
+ tuple_class->relkind != RELKIND_GLOBAL_INDEX &&
tuple_class->relkind != RELKIND_PARTITIONED_INDEX &&
tuple_class->relkind != RELKIND_TOASTVALUE)
changeDependencyOnOwner(RelationRelationId, relationOid,
@@ -14232,6 +14255,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
(void) view_reloptions(newOptions, true);
break;
case RELKIND_INDEX:
+ case RELKIND_GLOBAL_INDEX:
case RELKIND_PARTITIONED_INDEX:
(void) index_reloptions(rel->rd_indam->amoptions, newOptions, true);
break;
@@ -14419,7 +14443,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
newrlocator.spcOid = newTableSpace;
/* hand off to AM to actually create new rel storage and copy the data */
- if (rel->rd_rel->relkind == RELKIND_INDEX)
+ if (rel->rd_rel->relkind == RELKIND_INDEX ||
+ rel->rd_rel->relkind == RELKIND_GLOBAL_INDEX)
{
index_copy_data(rel, newrlocator);
}
@@ -14602,6 +14627,7 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
relForm->relkind != RELKIND_PARTITIONED_TABLE) ||
(stmt->objtype == OBJECT_INDEX &&
relForm->relkind != RELKIND_INDEX &&
+ relForm->relkind != RELKIND_GLOBAL_INDEX &&
relForm->relkind != RELKIND_PARTITIONED_INDEX) ||
(stmt->objtype == OBJECT_MATVIEW &&
relForm->relkind != RELKIND_MATVIEW))
@@ -17103,6 +17129,7 @@ RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid, Oid oldrelid,
errmsg("\"%s\" is not a composite type", rv->relname)));
if (reltype == OBJECT_INDEX && relkind != RELKIND_INDEX &&
+ relkind != RELKIND_GLOBAL_INDEX &&
relkind != RELKIND_PARTITIONED_INDEX
&& !IsA(stmt, RenameStmt))
ereport(ERROR,
@@ -17125,7 +17152,7 @@ RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid, Oid oldrelid,
*/
if (IsA(stmt, AlterObjectSchemaStmt))
{
- if (relkind == RELKIND_INDEX || relkind == RELKIND_PARTITIONED_INDEX)
+ if (relkind == RELKIND_INDEX || relkind == RELKIND_GLOBAL_INDEX || relkind == RELKIND_PARTITIONED_INDEX)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
errmsg("cannot change schema of index \"%s\"",
@@ -18870,6 +18897,7 @@ RangeVarCallbackForAttachIndex(const RangeVar *rv, Oid relOid, Oid oldRelOid,
return; /* concurrently dropped, so nothing to do */
classform = (Form_pg_class) GETSTRUCT(tuple);
if (classform->relkind != RELKIND_PARTITIONED_INDEX &&
+ classform->relkind != RELKIND_GLOBAL_INDEX &&
classform->relkind != RELKIND_INDEX)
ereport(ERROR,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 9defe37836..5417cafca1 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -990,7 +990,8 @@ estimate_rel_size(Relation rel, int32 *attr_widths,
table_relation_estimate_size(rel, attr_widths, pages, tuples,
allvisfrac);
}
- else if (rel->rd_rel->relkind == RELKIND_INDEX)
+ else if (rel->rd_rel->relkind == RELKIND_INDEX ||
+ rel->rd_rel->relkind == RELKIND_GLOBAL_INDEX)
{
/*
* XXX: It'd probably be good to move this into a callback, individual
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 2dddd8f302..a259079e8c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -491,7 +491,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
%type <str> unicode_normal_form
%type <boolean> opt_instead
-%type <boolean> opt_unique opt_verbose opt_full
+%type <boolean> opt_unique opt_verbose opt_full opt_global
%type <boolean> opt_freeze opt_analyze opt_default opt_recheck
%type <defelt> opt_binary copy_delimiter
@@ -7918,7 +7918,7 @@ defacl_privilege_target:
IndexStmt: CREATE opt_unique INDEX opt_concurrently opt_single_name
ON relation_expr access_method_clause '(' index_params ')'
- opt_include opt_unique_null_treatment opt_reloptions OptTableSpace where_clause
+ opt_include opt_unique_null_treatment opt_reloptions OptTableSpace where_clause opt_global
{
IndexStmt *n = makeNode(IndexStmt);
@@ -7933,6 +7933,7 @@ IndexStmt: CREATE opt_unique INDEX opt_concurrently opt_single_name
n->options = $14;
n->tableSpace = $15;
n->whereClause = $16;
+ n->global_index = $17;
n->excludeOpNames = NIL;
n->idxcomment = NULL;
n->indexOid = InvalidOid;
@@ -7950,7 +7951,7 @@ IndexStmt: CREATE opt_unique INDEX opt_concurrently opt_single_name
}
| CREATE opt_unique INDEX opt_concurrently IF_P NOT EXISTS name
ON relation_expr access_method_clause '(' index_params ')'
- opt_include opt_unique_null_treatment opt_reloptions OptTableSpace where_clause
+ opt_include opt_unique_null_treatment opt_reloptions OptTableSpace where_clause opt_global
{
IndexStmt *n = makeNode(IndexStmt);
@@ -7965,6 +7966,7 @@ IndexStmt: CREATE opt_unique INDEX opt_concurrently opt_single_name
n->options = $17;
n->tableSpace = $18;
n->whereClause = $19;
+ n->global_index = $20;
n->excludeOpNames = NIL;
n->idxcomment = NULL;
n->indexOid = InvalidOid;
@@ -7982,6 +7984,11 @@ IndexStmt: CREATE opt_unique INDEX opt_concurrently opt_single_name
}
;
+opt_global:
+ GLOBAL { $$ = true; }
+ | /*EMPTY*/ { $$ = false; }
+ ;
+
opt_unique:
UNIQUE { $$ = true; }
| /*EMPTY*/ { $$ = false; }
diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index 8140e79d8f..358e7df040 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -3982,6 +3982,7 @@ transformPartitionCmd(CreateStmtContext *cxt, PartitionCmd *cmd)
RelationGetRelationName(parentRel))));
break;
case RELKIND_INDEX:
+ case RELKIND_GLOBAL_INDEX:
/* the index must be partitioned */
ereport(ERROR,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
diff --git a/src/backend/utils/adt/amutils.c b/src/backend/utils/adt/amutils.c
index 60fd396f24..faf810da2f 100644
--- a/src/backend/utils/adt/amutils.c
+++ b/src/backend/utils/adt/amutils.c
@@ -176,6 +176,7 @@ indexam_property(FunctionCallInfo fcinfo,
PG_RETURN_NULL();
rd_rel = (Form_pg_class) GETSTRUCT(tuple);
if (rd_rel->relkind != RELKIND_INDEX &&
+ rd_rel->relkind != RELKIND_GLOBAL_INDEX &&
rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
{
ReleaseSysCache(tuple);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index bd6cd4e47b..151df2920c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -480,6 +480,7 @@ RelationParseRelOptions(Relation relation, HeapTuple tuple)
amoptsfn = NULL;
break;
case RELKIND_INDEX:
+ case RELKIND_GLOBAL_INDEX:
case RELKIND_PARTITIONED_INDEX:
amoptsfn = relation->rd_indam->amoptions;
break;
@@ -1202,6 +1203,7 @@ retry:
* initialize access method information
*/
if (relation->rd_rel->relkind == RELKIND_INDEX ||
+ relation->rd_rel->relkind == RELKIND_GLOBAL_INDEX ||
relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
RelationInitIndexAccessInfo(relation);
else if (RELKIND_HAS_TABLE_AM(relation->rd_rel->relkind) ||
@@ -2082,6 +2084,7 @@ RelationIdGetRelation(Oid relationId)
* a headache for indexes that reload itself depends on.
*/
if (rd->rd_rel->relkind == RELKIND_INDEX ||
+ rd->rd_rel->relkind == RELKIND_GLOBAL_INDEX ||
rd->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
RelationReloadIndexInfo(rd);
else
@@ -2222,6 +2225,7 @@ RelationReloadIndexInfo(Relation relation)
/* Should be called only for invalidated, live indexes */
Assert((relation->rd_rel->relkind == RELKIND_INDEX ||
+ relation->rd_rel->relkind == RELKIND_GLOBAL_INDEX ||
relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
!relation->rd_isvalid &&
relation->rd_droppedSubid == InvalidSubTransactionId);
@@ -2349,7 +2353,8 @@ RelationReloadNailed(Relation relation)
if (!IsTransactionState() || relation->rd_refcnt <= 1)
return;
- if (relation->rd_rel->relkind == RELKIND_INDEX)
+ if (relation->rd_rel->relkind == RELKIND_INDEX ||
+ relation->rd_rel->relkind == RELKIND_GLOBAL_INDEX)
{
/*
* If it's a nailed-but-not-mapped index, then we need to re-read the
@@ -2542,6 +2547,7 @@ RelationClearRelation(Relation relation, bool rebuild)
* index, and we check for pg_index updates too.
*/
if ((relation->rd_rel->relkind == RELKIND_INDEX ||
+ relation->rd_rel->relkind == RELKIND_GLOBAL_INDEX ||
relation->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) &&
relation->rd_refcnt > 0 &&
relation->rd_indexcxt != NULL)
@@ -3714,7 +3720,8 @@ RelationSetNewRelfilenumber(Relation relation, char persistence)
newrelfilenumber = GetNewRelFileNumber(relation->rd_rel->reltablespace,
NULL, persistence);
}
- else if (relation->rd_rel->relkind == RELKIND_INDEX)
+ else if (relation->rd_rel->relkind == RELKIND_INDEX ||
+ relation->rd_rel->relkind == RELKIND_GLOBAL_INDEX)
{
if (!OidIsValid(binary_upgrade_next_index_pg_class_relfilenumber))
ereport(ERROR,
@@ -6120,7 +6127,8 @@ load_relcache_init_file(bool shared)
* If it's an index, there's more to do. Note we explicitly ignore
* partitioned indexes here.
*/
- if (rel->rd_rel->relkind == RELKIND_INDEX)
+ if (rel->rd_rel->relkind == RELKIND_INDEX ||
+ rel->rd_rel->relkind == RELKIND_GLOBAL_INDEX)
{
MemoryContext indexcxt;
Oid *opfamily;
@@ -6515,7 +6523,8 @@ write_relcache_init_file(bool shared)
* If it's an index, there's more to do. Note we explicitly ignore
* partitioned indexes here.
*/
- if (rel->rd_rel->relkind == RELKIND_INDEX)
+ if (rel->rd_rel->relkind == RELKIND_INDEX ||
+ rel->rd_rel->relkind == RELKIND_GLOBAL_INDEX)
{
/* write the pg_index tuple */
/* we assume this was created by heap_copytuple! */
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2eae519b1d..149281167e 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -1874,6 +1874,7 @@ describeOneTableDetails(const char *schemaname,
attgenerated_col = cols++;
}
if (tableinfo.relkind == RELKIND_INDEX ||
+ tableinfo.relkind == RELKIND_GLOBAL_INDEX ||
tableinfo.relkind == RELKIND_PARTITIONED_INDEX)
{
if (pset.sversion >= 110000)
@@ -1914,6 +1915,7 @@ describeOneTableDetails(const char *schemaname,
/* stats target, if relevant to relkind */
if (tableinfo.relkind == RELKIND_RELATION ||
tableinfo.relkind == RELKIND_INDEX ||
+ tableinfo.relkind == RELKIND_GLOBAL_INDEX ||
tableinfo.relkind == RELKIND_PARTITIONED_INDEX ||
tableinfo.relkind == RELKIND_MATVIEW ||
tableinfo.relkind == RELKIND_FOREIGN_TABLE ||
@@ -1979,6 +1981,14 @@ describeOneTableDetails(const char *schemaname,
printfPQExpBuffer(&title, _("Index \"%s.%s\""),
schemaname, relationname);
break;
+ case RELKIND_GLOBAL_INDEX:
+ if (tableinfo.relpersistence == 'u')
+ printfPQExpBuffer(&title, _("Unlogged global index \"%s.%s\""),
+ schemaname, relationname);
+ else
+ printfPQExpBuffer(&title, _("Global index \"%s.%s\""),
+ schemaname, relationname);
+ break;
case RELKIND_PARTITIONED_INDEX:
if (tableinfo.relpersistence == 'u')
printfPQExpBuffer(&title, _("Unlogged partitioned index \"%s.%s\""),
@@ -2244,6 +2254,7 @@ describeOneTableDetails(const char *schemaname,
}
if (tableinfo.relkind == RELKIND_INDEX ||
+ tableinfo.relkind == RELKIND_GLOBAL_INDEX ||
tableinfo.relkind == RELKIND_PARTITIONED_INDEX)
{
/* Footer information about an index */
@@ -2344,7 +2355,7 @@ describeOneTableDetails(const char *schemaname,
/*
* If it's a partitioned index, we'll print the tablespace below
*/
- if (tableinfo.relkind == RELKIND_INDEX)
+ if (tableinfo.relkind == RELKIND_INDEX || tableinfo.relkind == RELKIND_GLOBAL_INDEX)
add_tablespace_footer(&cont, tableinfo.relkind,
tableinfo.tablespace, true);
}
@@ -3546,6 +3557,7 @@ add_tablespace_footer(printTableContent *const cont, char relkind,
if (relkind == RELKIND_RELATION ||
relkind == RELKIND_MATVIEW ||
relkind == RELKIND_INDEX ||
+ relkind == RELKIND_GLOBAL_INDEX ||
relkind == RELKIND_PARTITIONED_TABLE ||
relkind == RELKIND_PARTITIONED_INDEX ||
relkind == RELKIND_TOASTVALUE)
@@ -3869,6 +3881,7 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
" WHEN " CppAsString2(RELKIND_VIEW) " THEN '%s'"
" WHEN " CppAsString2(RELKIND_MATVIEW) " THEN '%s'"
" WHEN " CppAsString2(RELKIND_INDEX) " THEN '%s'"
+ " WHEN " CppAsString2(RELKIND_GLOBAL_INDEX) " THEN '%s'"
" WHEN " CppAsString2(RELKIND_SEQUENCE) " THEN '%s'"
" WHEN " CppAsString2(RELKIND_TOASTVALUE) " THEN '%s'"
" WHEN " CppAsString2(RELKIND_FOREIGN_TABLE) " THEN '%s'"
@@ -3882,6 +3895,7 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
gettext_noop("view"),
gettext_noop("materialized view"),
gettext_noop("index"),
+ gettext_noop("global index"),
gettext_noop("sequence"),
gettext_noop("TOAST table"),
gettext_noop("foreign table"),
@@ -3963,6 +3977,7 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
appendPQExpBufferStr(&buf, CppAsString2(RELKIND_MATVIEW) ",");
if (showIndexes)
appendPQExpBufferStr(&buf, CppAsString2(RELKIND_INDEX) ","
+ CppAsString2(RELKIND_GLOBAL_INDEX) ","
CppAsString2(RELKIND_PARTITIONED_INDEX) ",");
if (showSeq)
appendPQExpBufferStr(&buf, CppAsString2(RELKIND_SEQUENCE) ",");
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 91c28868d4..1ddfc9af21 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -65,6 +65,7 @@ extern void index_check_primary_key(Relation heapRel,
#define INDEX_CREATE_IF_NOT_EXISTS (1 << 4)
#define INDEX_CREATE_PARTITIONED (1 << 5)
#define INDEX_CREATE_INVALID (1 << 6)
+#define INDEX_CREATE_GLOBAL (1 << 7)
extern Oid index_create(Relation heapRelation,
const char *indexRelationName,
diff --git a/src/include/catalog/pg_class.h b/src/include/catalog/pg_class.h
index e1f4eefa22..eb190e8f14 100644
--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -168,6 +168,7 @@ DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3455, ClassTblspcRelfilenodeInd
#define RELKIND_FOREIGN_TABLE 'f' /* foreign table */
#define RELKIND_PARTITIONED_TABLE 'p' /* partitioned table */
#define RELKIND_PARTITIONED_INDEX 'I' /* partitioned index */
+#define RELKIND_GLOBAL_INDEX 'g' /* global index */
#define RELPERSISTENCE_PERMANENT 'p' /* regular table */
#define RELPERSISTENCE_UNLOGGED 'u' /* unlogged permanent table */
@@ -194,6 +195,7 @@ DECLARE_INDEX(pg_class_tblspc_relfilenode_index, 3455, ClassTblspcRelfilenodeInd
#define RELKIND_HAS_STORAGE(relkind) \
((relkind) == RELKIND_RELATION || \
(relkind) == RELKIND_INDEX || \
+ (relkind) == RELKIND_GLOBAL_INDEX || \
(relkind) == RELKIND_SEQUENCE || \
(relkind) == RELKIND_TOASTVALUE || \
(relkind) == RELKIND_MATVIEW)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 7caff62af7..880dd6011a 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2991,6 +2991,7 @@ typedef struct IndexStmt
bool if_not_exists; /* just do nothing if index already exists? */
bool reset_default_tblspc; /* reset default_tablespace prior to
* executing */
+ bool global_index; /* true if index is global */
} IndexStmt;
/* ----------------------
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 1bdd430f06..eed20063c1 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1421,3 +1421,43 @@ Indexes:
"parted_index_col_drop11_b_idx" btree (b)
drop table parted_index_col_drop;
+-- create global index using non-partition key
+create table gidxpart (a int, b int, c text) partition by range (a);
+create table gidxpart1 partition of gidxpart for values from (0) to (10);
+create table gidxpart2 partition of gidxpart for values from (10) to (100);
+create unique index gidx_u on gidxpart using btree(b) global;
+select relname, relhasindex, relkind from pg_class where relname like '%gidx%' order by oid;
+ relname | relhasindex | relkind
+-----------------+-------------+---------
+ gidxpart | t | p
+ gidxpart1 | t | r
+ gidxpart2 | t | r
+ gidx_u | f | I
+ gidxpart1_b_idx | f | g
+ gidxpart2_b_idx | f | g
+(6 rows)
+
+\d+ gidxpart
+ Partitioned table "public.gidxpart"
+ Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
+--------+---------+-----------+----------+---------+----------+--------------+-------------
+ a | integer | | | | plain | |
+ b | integer | | | | plain | |
+ c | text | | | | extended | |
+Partition key: RANGE (a)
+Indexes:
+ "gidx_u" UNIQUE, btree (b)
+Partitions: gidxpart1 FOR VALUES FROM (0) TO (10),
+ gidxpart2 FOR VALUES FROM (10) TO (100)
+
+\d+ gidx_u
+ Partitioned index "public.gidx_u"
+ Column | Type | Key? | Definition | Storage | Stats target
+--------+---------+------+------------+---------+--------------
+ b | integer | yes | b | plain |
+unique, btree, for table "public.gidxpart"
+Partitions: gidxpart1_b_idx,
+ gidxpart2_b_idx
+
+drop index gidx_u;
+drop table gidxpart;
diff --git a/src/test/regress/sql/indexing.sql b/src/test/regress/sql/indexing.sql
index 429120e710..2169f28e69 100644
--- a/src/test/regress/sql/indexing.sql
+++ b/src/test/regress/sql/indexing.sql
@@ -760,3 +760,14 @@ alter table parted_index_col_drop drop column c;
\d parted_index_col_drop2
\d parted_index_col_drop11
drop table parted_index_col_drop;
+
+-- create global index using non-partition key
+create table gidxpart (a int, b int, c text) partition by range (a);
+create table gidxpart1 partition of gidxpart for values from (0) to (10);
+create table gidxpart2 partition of gidxpart for values from (10) to (100);
+create unique index gidx_u on gidxpart using btree(b) global;
+select relname, relhasindex, relkind from pg_class where relname like '%gidx%' order by oid;
+\d+ gidxpart
+\d+ gidx_u
+drop index gidx_u;
+drop table gidxpart;
--
2.17.1
0002-support-global-unique-index-create.patchapplication/octet-stream; name=0002-support-global-unique-index-create.patchDownload
From 7d6ad6345b97d6b9588dcdab39fab614c8cb6ece Mon Sep 17 00:00:00 2001
From: David Zhang <david.zhang@highgo.ca>
Date: Thu, 17 Nov 2022 12:22:16 -0800
Subject: [PATCH 2/4] support global unique index create
---
src/backend/access/nbtree/nbtsort.c | 364 ++++++++++++++++++++++-
src/backend/commands/indexcmds.c | 158 ++++++++++
src/backend/storage/file/sharedfileset.c | 10 +
src/backend/utils/sort/tuplesort.c | 55 +++-
src/include/commands/defrem.h | 2 +
src/include/nodes/execnodes.h | 5 +
src/include/nodes/parsenodes.h | 4 +
src/include/storage/sharedfileset.h | 1 +
src/include/utils/tuplesort.h | 10 +
src/test/regress/expected/indexing.out | 12 +
src/test/regress/sql/indexing.sql | 12 +
11 files changed, 622 insertions(+), 11 deletions(-)
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 501e011ce1..deb399e36a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -71,6 +71,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000004)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xA000000000000005)
#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xA000000000000006)
+#define PARALLEL_KEY_TUPLESORT_GLOBAL UINT64CONST(0xA000000000000007)
/*
* DISABLE_LEADER_PARTICIPATION disables the leader's participation in
@@ -110,6 +111,7 @@ typedef struct BTShared
bool nulls_not_distinct;
bool isconcurrent;
int scantuplesortstates;
+ bool isglobal;
/*
* workersdonecv is used to monitor the progress of workers. All parallel
@@ -197,6 +199,7 @@ typedef struct BTLeader
Snapshot snapshot;
WalUsage *walusage;
BufferUsage *bufferusage;
+ Sharedsort *sharedsortglobal;
} BTLeader;
/*
@@ -226,6 +229,16 @@ typedef struct BTBuildState
* BTBuildState. Workers have their own spool and spool2, though.)
*/
BTLeader *btleader;
+
+ /*
+ * global unique index related parameters
+ */
+ BTSpool *spoolglobal; /* spoolglobal is used on global unique index
+ * build in parallel */
+ bool global_index; /* true if index is global */
+ int globalIndexPart; /* partition number indication */
+ int nparts; /* number of partitions involved in global
+ * unique index build in parallel */
} BTBuildState;
/*
@@ -291,8 +304,29 @@ static void _bt_leader_participate_as_worker(BTBuildState *buildstate);
static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
BTShared *btshared, Sharedsort *sharedsort,
Sharedsort *sharedsort2, int sortmem,
- bool progress);
+ bool progress, Sharedsort *sharedsortglobal,
+ bool isworker);
+static BTSpool *global_btspool;
+Sharedsort *global_sharedsort;
+
+static void
+btinit_global_spool(Relation heap, Relation index, BTBuildState *buildstate)
+{
+ elog(DEBUG2, "%s: init global index spool", __FUNCTION__);
+ global_btspool = (BTSpool *) palloc0(sizeof(BTSpool));
+ global_btspool->heap = heap;
+ global_btspool->index = index;
+ global_btspool->isunique = buildstate->isunique;
+
+ global_btspool->sortstate =
+ tuplesort_begin_index_btree(heap, index, buildstate->isunique,
+ buildstate->nulls_not_distinct,
+ maintenance_work_mem, NULL,
+ false);
+
+ tuplesort_mark_global_sort(global_btspool->sortstate);
+}
/*
* btbuild() -- build a new btree index.
@@ -318,22 +352,149 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
buildstate.indtuples = 0;
buildstate.btleader = NULL;
+ if (indexInfo->ii_Global_index && indexInfo->ii_Unique)
+ {
+ /* copy global unique index related parameters to buildstate */
+ buildstate.global_index = indexInfo->ii_Global_index;
+ buildstate.globalIndexPart = indexInfo->ii_GlobalIndexPart;
+ buildstate.nparts = indexInfo->ii_Nparts;
+ }
+ else
+ {
+ /* disable global unique check */
+ buildstate.global_index = false;
+ buildstate.globalIndexPart = 0;
+ buildstate.nparts = 0;
+ }
+
/*
* We expect to be called exactly once for any index relation. If that's
- * not the case, big trouble's what we have.
+ * not the case, big trouble's what we have unless you set
+ * buildGlobalSpool to true, which only builds the global spool for global
+ * uniqueness check and not physical index tuples
*/
- if (RelationGetNumberOfBlocks(index) != 0)
+ if (indexInfo->ii_BuildGlobalSpool == false && RelationGetNumberOfBlocks(index) != 0)
elog(ERROR, "index \"%s\" already contains data",
RelationGetRelationName(index));
reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+ if (indexInfo->ii_ParallelWorkers > 0)
+ {
+ /*
+ * global uniqueness check in parallel build case
+ */
+ if (indexInfo->ii_Global_index && indexInfo->ii_Unique && global_btspool)
+ {
+ /*
+ * indexInfo->ii_GlobalIndexPart <= 0 indicates the first and
+ * intermediate partition to build index on. For parallel global
+ * unique index build, we need to clean up global_btspool
+ * structure to prevent resource leak
+ */
+ if (indexInfo->ii_GlobalIndexPart <= 0)
+ {
+ _bt_spooldestroy(global_btspool);
+ global_btspool = NULL;
+ }
+
+ /*
+ * indexInfo->ii_GlobalIndexPart > 0 indicates the last partition
+ * to build index on. For parallel global unique index build, we
+ * need to call tuplesort_performsort to merge all the tapes
+ * created from previous partition build runs and do a final
+ * sorting to determine global uniqueness
+ */
+ if (indexInfo->ii_GlobalIndexPart > 0)
+ {
+ IndexTuple itup;
+
+ elog(DEBUG2, "last partitioned to build global index in parallel. Perform merge run and "
+ "uniqueness check now...");
+ tuplesort_performsort(buildstate.spoolglobal->sortstate);
+
+ /*
+ * this loop checks for uniqueness after all tapes have been
+ * merged. If a duplicate is found, we will error out.
+ */
+ while ((itup = tuplesort_getindextuple(buildstate.spoolglobal->sortstate,
+ true)) != NULL)
+ {
+ /*
+ * simply checking for global uniqueness, nothing to do
+ * here
+ */
+ }
+
+ /*
+ * no global uniqueness violation is found at this point,
+ * remove all the tapes (temp files) and destroy resources and
+ * continue to build the actual index.
+ */
+ _bt_spooldestroy(buildstate.spoolglobal);
+ _bt_spooldestroy(global_btspool);
+ global_btspool = NULL;
+ if (global_sharedsort)
+ {
+ pfree(global_sharedsort);
+ global_sharedsort = NULL;
+ }
+ }
+ }
+ }
+ else
+ {
+ /*
+ * global uniqueness check in serial build case
+ */
+ if (indexInfo->ii_GlobalIndexPart > 0 && indexInfo->ii_Global_index &&
+ indexInfo->ii_Unique && global_btspool)
+ {
+ /*
+ * indexInfo->ii_GlobalIndexPart > 0 indicates the last partition
+ * to build index on. For serial global unique index build, we
+ * call tuplesort_performsort on global_btspool->sortstate which
+ * should contain index tuples from all partitions. If a duplicate
+ * is found, we will error out.
+ */
+ elog(DEBUG2, "last partitioned to build global index serially. Sorting global_btspool for "
+ "uniqueness check now...");
+ tuplesort_performsort(global_btspool->sortstate);
+
+ /*
+ * no global uniqueness violation is found at this point, destroy
+ * global_btspool structure and continue to build the actual
+ * index.
+ */
+ _bt_spooldestroy(global_btspool);
+ global_btspool = NULL;
+ }
+ }
+
/*
- * Finish the build by (1) completing the sort of the spool file, (2)
- * inserting the sorted tuples into btree pages and (3) building the upper
- * levels. Finally, it may also be necessary to end use of parallelism.
+ * if indexInfo->ii_BuildGlobalSpool is set, we will not continue to build
+ * the actual index. This is used during a new partition attach, where we
+ * just want to populate the global_btspool from current partitions
+ * without building the actual indexes (because they exist already). Then,
+ * we take that global_btspool and sort it with the tuples in newly
+ * attached partition to determine if attach would violate global
+ * uniquenes check
*/
- _bt_leafbuild(buildstate.spool, buildstate.spool2);
+ if (indexInfo->ii_BuildGlobalSpool)
+ {
+ elog(DEBUG2, "ii_BuildGlobalSpool is set. Skip building actual index content");
+ }
+ else
+ {
+ /*
+ * Finish the build by (1) completing the sort of the spool file, (2)
+ * inserting the sorted tuples into btree pages and (3) building the
+ * upper levels. Finally, it may also be necessary to end use of
+ * parallelism.
+ */
+ _bt_leafbuild(buildstate.spool, buildstate.spool2);
+ }
+
_bt_spooldestroy(buildstate.spool);
if (buildstate.spool2)
_bt_spooldestroy(buildstate.spool2);
@@ -374,6 +535,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
BTSpool *btspool = (BTSpool *) palloc0(sizeof(BTSpool));
SortCoordinate coordinate = NULL;
double reltuples = 0;
+ SortCoordinate coordinateglobal = NULL;
/*
* We size the sort area as maintenance_work_mem rather than work_mem to
@@ -395,8 +557,40 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
/* Attempt to launch parallel worker scan when required */
if (indexInfo->ii_ParallelWorkers > 0)
+ {
+ /*
+ * while building the last partition table in parallel global unique
+ * index build, we need to allocate a primary spoolglobal, which will
+ * be used in the leader process later to "take over" all tapes
+ * created by previous partition runs
+ */
+ if (buildstate->isunique && buildstate->global_index &&
+ buildstate->globalIndexPart > 0)
+ {
+ elog(DEBUG2, "init primary spoolglobal in last partition for parallel index build");
+ buildstate->spoolglobal = (BTSpool *) palloc0(sizeof(BTSpool));
+ buildstate->spoolglobal->heap = heap;
+ buildstate->spoolglobal->index = index;
+ buildstate->spoolglobal->isunique = indexInfo->ii_Unique;
+ }
_bt_begin_parallel(buildstate, indexInfo->ii_Concurrent,
indexInfo->ii_ParallelWorkers);
+ }
+ else
+ {
+ /*
+ * in serial global unique index build, we just need to allocate a
+ * single global_btspool structure at the first partition build. The
+ * rest of the partitions will add their index tuples to this single
+ * global spool structure
+ */
+ if (buildstate->isunique && buildstate->global_index &&
+ buildstate->globalIndexPart < 0)
+ {
+ elog(DEBUG2, "init new global_btspool for serial index build");
+ btinit_global_spool(heap, index, buildstate);
+ }
+ }
/*
* If parallel build requested and at least one worker process was
@@ -409,6 +603,22 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
coordinate->nParticipants =
buildstate->btleader->nparticipanttuplesorts;
coordinate->sharedsort = buildstate->btleader->sharedsort;
+
+ /*
+ * set up a coordinator state for primary spoolglobal if we are doing
+ * global unique index build in parallel. We do this at the last
+ * partition create in parallel mode
+ */
+ if (buildstate->isunique && buildstate->global_index && buildstate->globalIndexPart > 0)
+ {
+ elog(DEBUG2, "set up coordinate state for primary spoolglobal for "
+ "parallel index build case");
+ coordinateglobal = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+ coordinateglobal->isWorker = false;
+ coordinateglobal->nParticipants =
+ tuplesort_get_curr_workers(buildstate->btleader->sharedsortglobal);
+ coordinateglobal->sharedsort = buildstate->btleader->sharedsortglobal;
+ }
}
/*
@@ -438,6 +648,23 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
maintenance_work_mem, coordinate,
TUPLESORT_NONE);
+ /*
+ * initialize primary spoolglobal if we are doing global unique index
+ * build in parallel. We do this at the last partition create in parallel
+ * mode
+ */
+ if (buildstate->btleader && buildstate->isunique &&
+ buildstate->global_index && buildstate->globalIndexPart > 0)
+ {
+ elog(DEBUG2, "tuplesort_begin_index_btree for primary spoolglobal for "
+ "parallel index build case");
+ buildstate->spoolglobal->sortstate =
+ tuplesort_begin_index_btree(heap, index, buildstate->isunique,
+ buildstate->nulls_not_distinct,
+ maintenance_work_mem, coordinateglobal,
+ false);
+ }
+
/*
* If building a unique index, put dead tuples in a second spool to keep
* them out of the uniqueness check. We expect that the second spool (for
@@ -487,6 +714,19 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
reltuples = _bt_parallel_heapscan(buildstate,
&indexInfo->ii_BrokenHotChain);
+ /*
+ * all parallel workers should finish at this point, make backup of
+ * sharedsortglobal, which is needed to persist the logical tape and temp
+ * file information for next partition build when building global unique
+ * index in parallel
+ */
+ if (buildstate->btleader && buildstate->global_index &&
+ buildstate->isunique)
+ {
+ elog(DEBUG2, "all workers finished, backup sharedsortglobal");
+ tuplesort_copy_sharedsort2(global_sharedsort, global_btspool->sortstate);
+ }
+
/*
* Set the progress target for the next phase. Reset the block number
* values set by table_index_build_scan
@@ -599,7 +839,11 @@ _bt_build_callback(Relation index,
* processing
*/
if (tupleIsAlive || buildstate->spool2 == NULL)
+ {
_bt_spool(buildstate->spool, tid, values, isnull);
+ if (buildstate->global_index && buildstate->isunique && global_btspool)
+ _bt_spool(global_btspool, tid, values, isnull);
+ }
else
{
/* dead tuples are put into spool2 */
@@ -1458,9 +1702,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
Snapshot snapshot;
Size estbtshared;
Size estsort;
+ Size estsortglobal = 0;
BTShared *btshared;
Sharedsort *sharedsort;
Sharedsort *sharedsort2;
+ Sharedsort *sharedsortglobal;
BTSpool *btspool = buildstate->spool;
BTLeader *btleader = (BTLeader *) palloc0(sizeof(BTLeader));
WalUsage *walusage;
@@ -1504,6 +1750,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
estsort = tuplesort_estimate_shared(scantuplesortstates);
shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+ if (buildstate->isunique && buildstate->global_index)
+ {
+ /* global unique index case will estimate 1 more sharesort struct */
+ estsortglobal = tuplesort_estimate_shared(scantuplesortstates * buildstate->nparts);
+ shm_toc_estimate_chunk(&pcxt->estimator, estsortglobal);
+ }
+
/*
* Unique case requires a second spool, and so we may have to account for
* another shared workspace for that -- PARALLEL_KEY_TUPLESORT_SPOOL2
@@ -1571,6 +1824,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btshared->havedead = false;
btshared->indtuples = 0.0;
btshared->brokenhotchain = false;
+ btshared->isglobal = buildstate->global_index;
table_parallelscan_initialize(btspool->heap,
ParallelTableScanFromBTShared(btshared),
snapshot);
@@ -1603,6 +1857,43 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT_SPOOL2, sharedsort2);
}
+ if (buildstate->isunique && buildstate->global_index)
+ {
+ /* global unique index case will allocate 1 more sharesort struct */
+ sharedsortglobal = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsortglobal);
+ if (!sharedsortglobal)
+ elog(ERROR, "failed to allocate shared memory space");
+
+ if (buildstate->globalIndexPart == -1)
+ {
+ elog(DEBUG2, "initialize and make a copy of sharedsortglobal for first time");
+ tuplesort_initialize_shared(sharedsortglobal, scantuplesortstates * buildstate->nparts, NULL);
+
+ /* save a copy of sharedsortglobal */
+ global_sharedsort = (Sharedsort *) palloc(estsortglobal);
+ tuplesort_copy_sharedsort(global_sharedsort, sharedsortglobal);
+ }
+ else if (buildstate->globalIndexPart == 0 || buildstate->globalIndexPart == 1)
+ {
+ elog(DEBUG2, "restore the copy of sharedsortglobal for subsequent processing");
+ tuplesort_copy_sharedsort(sharedsortglobal, global_sharedsort);
+
+ /* register for cleanup at the last partition index build */
+ if (buildstate->globalIndexPart == 1)
+ {
+ tuplesort_register_cleanup_callback(sharedsortglobal, pcxt->seg);
+ }
+ }
+ else
+ {
+ elog(ERROR, "invalid global inedx partition value %d", buildstate->globalIndexPart);
+ }
+
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT_GLOBAL, sharedsortglobal);
+ }
+ else
+ sharedsortglobal = NULL;
+
/* Store query string for workers */
if (debug_query_string)
{
@@ -1636,6 +1927,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btleader->snapshot = snapshot;
btleader->walusage = walusage;
btleader->bufferusage = bufferusage;
+ btleader->sharedsortglobal = sharedsortglobal;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1780,7 +2072,8 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
/* Perform work common to all participants */
_bt_parallel_scan_and_sort(leaderworker, leaderworker2, btleader->btshared,
btleader->sharedsort, btleader->sharedsort2,
- sortmem, true);
+ sortmem, true, btleader->sharedsortglobal,
+ false);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
@@ -1803,6 +2096,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
BTShared *btshared;
Sharedsort *sharedsort;
Sharedsort *sharedsort2;
+ Sharedsort *sharedsortglobal;
Relation heapRel;
Relation indexRel;
LOCKMODE heapLockmode;
@@ -1878,13 +2172,26 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort2, seg);
}
+ if (btshared->isunique && btshared->isglobal)
+ {
+ sharedsortglobal = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT_GLOBAL, false);
+ tuplesort_attach_shared(sharedsortglobal, seg);
+ elog(DEBUG2, "worker %d processing global unique index", MyProcPid);
+ }
+ else
+ {
+ sharedsortglobal = NULL;
+ elog(DEBUG2, "worker %d processing regular index", MyProcPid);
+ }
+
/* Prepare to track buffer usage during parallel execution */
InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
_bt_parallel_scan_and_sort(btspool, btspool2, btshared, sharedsort,
- sharedsort2, sortmem, false);
+ sharedsort2, sortmem, false,
+ sharedsortglobal, true);
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
@@ -1919,7 +2226,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
static void
_bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
BTShared *btshared, Sharedsort *sharedsort,
- Sharedsort *sharedsort2, int sortmem, bool progress)
+ Sharedsort *sharedsort2, int sortmem, bool progress,
+ Sharedsort *sharedsortglobal, bool isworker)
{
SortCoordinate coordinate;
BTBuildState buildstate;
@@ -1965,6 +2273,28 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
false);
}
+
+ /* global index */
+ if (sharedsortglobal)
+ {
+ SortCoordinate coordinate3;
+
+ global_btspool = (BTSpool *) palloc0(sizeof(BTSpool));
+ global_btspool->heap = btspool->heap;
+ global_btspool->index = btspool->index;
+ global_btspool->isunique = btspool->isunique;
+
+ coordinate3 = palloc0(sizeof(SortCoordinateData));
+ coordinate3->isWorker = true;
+ coordinate3->nParticipants = -1;
+ coordinate3->sharedsort = sharedsortglobal;
+ global_btspool->sortstate =
+ tuplesort_begin_index_btree(global_btspool->heap, global_btspool->index, global_btspool->isunique,
+ btspool->nulls_not_distinct, sortmem, coordinate3,
+ false);
+ }
+
+
/* Fill in buildstate for _bt_build_callback() */
buildstate.isunique = btshared->isunique;
buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
@@ -1975,6 +2305,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
buildstate.indtuples = 0;
buildstate.btleader = NULL;
+ if (btshared->isglobal && btshared->isunique)
+ {
+ /* fill global unique index related parameters in buildstate */
+ buildstate.global_index = btshared->isglobal;
+ }
+
/* Join parallel scan */
indexInfo = BuildIndexInfo(btspool->index);
indexInfo->ii_Concurrent = btshared->isconcurrent;
@@ -1997,6 +2333,11 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
tuplesort_performsort(btspool2->sortstate);
}
+ if (global_btspool)
+ {
+ tuplesort_performsort(global_btspool->sortstate);
+ }
+
/*
* Done. Record ambuild statistics, and whether we encountered a broken
* HOT chain.
@@ -2018,4 +2359,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
tuplesort_end(btspool->sortstate);
if (btspool2)
tuplesort_end(btspool2->sortstate);
+
+ if (global_btspool && isworker)
+ tuplesort_end(global_btspool->sortstate);
}
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 099bf68e77..51f6b46c99 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1145,6 +1145,13 @@ DefineIndex(Oid relationId,
flags |= INDEX_CREATE_PARTITIONED;
if (stmt->primary)
flags |= INDEX_CREATE_IS_PRIMARY;
+ /* copy the partition indication -1 = first, 0 = N/A, 1 = last */
+ if (stmt->global_index)
+ {
+ indexInfo->ii_Global_index = stmt->global_index;
+ indexInfo->ii_GlobalIndexPart = stmt->globalIndexPart;
+ indexInfo->ii_Nparts = stmt->nparts;
+ }
/*
* If the table is partitioned, and recursion was declined but partitions
@@ -1380,6 +1387,24 @@ DefineIndex(Oid relationId,
bool found_whole_row;
ListCell *lc;
+ if (i == nparts - 1 && stmt->global_index)
+ {
+ elog(DEBUG2, "mark as last partitioned to scan");
+ childStmt->globalIndexPart = 1;
+ }
+
+ if (i == 0 && stmt->global_index)
+ {
+ elog(DEBUG2, "mark as first partitioned to scan");
+ childStmt->globalIndexPart = -1;
+ }
+
+ if (stmt->global_index)
+ {
+ childStmt->nparts = nparts;
+ elog(DEBUG2, "total partitions to build global index %d", childStmt->nparts);
+ }
+
/*
* We can't use the same index name for the child index,
* so clear idxname to let the recursive invocation choose
@@ -4367,3 +4392,136 @@ set_indexsafe_procflags(void)
ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
LWLockRelease(ProcArrayLock);
}
+
+bool
+PopulateGlobalSpool(Relation idxRel, Relation heapRel, IndexStmt *stmt)
+{
+ IndexInfo *idxinfo;
+ int numberOfKeyAttributes;
+ int numberOfAttributes;
+ List *allIndexParams;
+ Oid accessMethodId;
+ Form_pg_am accessMethodForm;
+ char *accessMethodName;
+ HeapTuple tuple;
+ bool concurrent;
+ Oid *typeObjectId;
+ Oid *collationObjectId;
+ Oid *classObjectId;
+ int16 *coloptions;
+ IndexAmRoutine *amRoutine;
+ bool amcanorder;
+ Oid root_save_userid;
+ int root_save_sec_context;
+ int root_save_nestlevel;
+
+ root_save_nestlevel = NewGUCNestLevel();
+
+ if (stmt->concurrent && get_rel_persistence(RelationGetRelid(heapRel)) != RELPERSISTENCE_TEMP)
+ concurrent = true;
+ else
+ concurrent = false;
+
+ allIndexParams = list_concat_copy(stmt->indexParams,
+ stmt->indexIncludingParams);
+
+ numberOfKeyAttributes = list_length(stmt->indexParams);
+ numberOfAttributes = list_length(allIndexParams);
+
+ accessMethodName = stmt->accessMethod;
+ tuple = SearchSysCache1(AMNAME, PointerGetDatum(accessMethodName));
+ if (!HeapTupleIsValid(tuple))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("access method \"%s\" does not exist",
+ accessMethodName)));
+ }
+
+ accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
+ accessMethodId = accessMethodForm->oid;
+ amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
+
+ GetUserIdAndSecContext(&root_save_userid, &root_save_sec_context);
+ SetUserIdAndSecContext(heapRel->rd_rel->relowner,
+ root_save_sec_context | SECURITY_RESTRICTED_OPERATION);
+
+ idxinfo = makeIndexInfo(numberOfAttributes,
+ numberOfKeyAttributes,
+ accessMethodId,
+ NIL, /* expressions, NIL for now */
+ make_ands_implicit((Expr *) stmt->whereClause),
+ stmt->unique,
+ stmt->nulls_not_distinct,
+ !concurrent,
+ concurrent);
+
+ typeObjectId = (Oid *) palloc(numberOfAttributes * sizeof(Oid));
+ collationObjectId = (Oid *) palloc(numberOfAttributes * sizeof(Oid));
+ classObjectId = (Oid *) palloc(numberOfAttributes * sizeof(Oid));
+ coloptions = (int16 *) palloc(numberOfAttributes * sizeof(int16));
+ amcanorder = amRoutine->amcanorder;
+
+ pfree(amRoutine);
+ ReleaseSysCache(tuple);
+
+ ComputeIndexAttrs(idxinfo,
+ typeObjectId, collationObjectId, classObjectId,
+ coloptions, allIndexParams,
+ stmt->excludeOpNames, RelationGetRelid(heapRel),
+ accessMethodName, accessMethodId,
+ amcanorder, stmt->isconstraint, root_save_userid,
+ root_save_sec_context, &root_save_nestlevel);
+
+ /* Fill global unique index related parameters */
+ idxinfo->ii_GlobalIndexPart = stmt->globalIndexPart;
+ idxinfo->ii_BuildGlobalSpool = true;
+ idxinfo->ii_Nparts = stmt->nparts;
+ idxinfo->ii_Global_index = stmt->global_index;
+
+ /*
+ * Determine worker process details for parallel CREATE INDEX. Currently,
+ * only btree has support for parallel builds.
+ */
+ if (IsNormalProcessingMode() && idxRel->rd_rel->relam == BTREE_AM_OID)
+ {
+ idxinfo->ii_ParallelWorkers =
+ plan_create_index_workers(RelationGetRelid(heapRel),
+ RelationGetRelid(idxRel));
+ }
+
+ idxRel->rd_indam->ambuild(heapRel, idxRel,
+ idxinfo);
+
+ return true;
+}
+
+void
+ChangeRelKind(Relation idxRel, char kind)
+{
+ Relation pg_class;
+ HeapTuple tuple;
+ Form_pg_class classform;
+
+ /*
+ * Get a writable copy of the pg_class tuple for the given relation.
+ */
+ pg_class = table_open(RelationRelationId, RowExclusiveLock);
+
+ tuple = SearchSysCacheCopy1(RELOID,
+ ObjectIdGetDatum(RelationGetRelid(idxRel)));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "could not find tuple for relation %u",
+ RelationGetRelid(idxRel));
+
+ classform = (Form_pg_class) GETSTRUCT(tuple);
+
+ classform->relkind = kind;
+ idxRel->rd_rel->relkind = kind;
+
+ CatalogTupleUpdate(pg_class, &tuple->t_self, tuple);
+
+ heap_freetuple(tuple);
+
+ table_close(pg_class, RowExclusiveLock);
+}
diff --git a/src/backend/storage/file/sharedfileset.c b/src/backend/storage/file/sharedfileset.c
index 0782f50ba6..d82d1abcc6 100644
--- a/src/backend/storage/file/sharedfileset.c
+++ b/src/backend/storage/file/sharedfileset.c
@@ -91,6 +91,16 @@ SharedFileSetDeleteAll(SharedFileSet *fileset)
FileSetDeleteAll(&fileset->fs);
}
+/*
+ * Register cleanup callback of an already initialized fileset.
+ */
+void
+SharedFileSetRegisterCleanupCallback(SharedFileSet *fileset, dsm_segment *seg)
+{
+ /* Register our cleanup callback. */
+ if (seg)
+ on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset));
+}
/*
* Callback function that will be invoked when this backend detaches from a
* DSM segment holding a SharedFileSet that it has created or attached to. If
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 416f02ba3c..ef613e2fe2 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -328,6 +328,7 @@ struct Tuplesortstate
*/
int64 abbrevNext; /* Tuple # at which to next check
* applicability */
+ bool isglobalsort;
/*
* Resource snapshot for time of sort start.
@@ -2190,7 +2191,14 @@ mergeruns(Tuplesortstate *state)
/* Tell logtape.c we won't be writing anymore */
LogicalTapeSetForgetFreeSpace(state->tapeset);
/* Initialize for the final merge pass */
- beginmerge(state);
+ if (state->isglobalsort)
+ {
+ elog(DEBUG2, "global unique index final merge run...");
+ mergeonerun(state);
+ }
+ else
+ beginmerge(state);
+
state->status = TSS_FINALMERGE;
return;
}
@@ -3002,6 +3010,51 @@ tuplesort_attach_shared(Sharedsort *shared, dsm_segment *seg)
SharedFileSetAttach(&shared->fileset, seg);
}
+void
+tuplesort_mark_global_sort(Tuplesortstate *state)
+{
+ state->isglobalsort = true;
+}
+
+void
+tuplesort_copy_sharedsort(Sharedsort *shared1, Sharedsort *shared2)
+{
+ if (!shared1 || !shared2)
+ elog(ERROR, "%s: cannot do sharedsort copy due to bad input", __FUNCTION__);
+
+ shared1->currentWorker = shared2->currentWorker;
+ shared1->mutex = shared2->mutex;
+ shared1->nTapes = shared2->nTapes;
+ shared1->fileset = shared2->fileset;
+ shared1->workersFinished = shared2->workersFinished;
+ memcpy(shared1->tapes, shared2->tapes, sizeof(TapeShare) * shared2->nTapes);
+}
+
+void
+tuplesort_copy_sharedsort2(Sharedsort *shared1, Tuplesortstate *state)
+{
+ if (!shared1 || !state)
+ elog(ERROR, "%s: cannot do sharedsort copy due to bad input", __FUNCTION__);
+
+ shared1->currentWorker = state->shared->currentWorker;
+ shared1->mutex = state->shared->mutex;
+ shared1->nTapes = state->shared->nTapes;
+ shared1->fileset = state->shared->fileset;
+ shared1->workersFinished = state->shared->workersFinished;
+ memcpy(shared1->tapes, state->shared->tapes, sizeof(TapeShare) * state->shared->nTapes);
+}
+
+int
+tuplesort_get_curr_workers(Sharedsort *shared)
+{
+ return shared->currentWorker;
+}
+
+void tuplesort_register_cleanup_callback(Sharedsort *shared, dsm_segment *seg)
+{
+ SharedFileSetRegisterCleanupCallback(&shared->fileset, seg);
+}
+
/*
* worker_get_identifier - Assign and return ordinal identifier for worker
*
diff --git a/src/include/commands/defrem.h b/src/include/commands/defrem.h
index 1d3ce246c9..1593357d5d 100644
--- a/src/include/commands/defrem.h
+++ b/src/include/commands/defrem.h
@@ -47,6 +47,8 @@ extern bool CheckIndexCompatible(Oid oldId,
extern Oid GetDefaultOpClass(Oid type_id, Oid am_id);
extern Oid ResolveOpClass(List *opclass, Oid attrType,
const char *accessMethodName, Oid accessMethodId);
+extern bool PopulateGlobalSpool(Relation ixsRel, Relation heapRel, IndexStmt *stmt);
+extern void ChangeRelKind(Relation idxRel, char kind);
/* commands/functioncmds.c */
extern ObjectAddress CreateFunction(ParseState *pstate, CreateFunctionStmt *stmt);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 01b1727fc0..c2c8039d3d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -197,6 +197,11 @@ typedef struct IndexInfo
int ii_ParallelWorkers;
Oid ii_Am;
void *ii_AmCache;
+ bool ii_Global_index; /* true if index is global */
+ int ii_GlobalIndexPart; /* partition number indication */
+ bool ii_BuildGlobalSpool; /* indicate to build global spool only */
+ int ii_Nparts; /* num partitions for global index build in
+ * parallel */
MemoryContext ii_Context;
} IndexInfo;
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 880dd6011a..8febb506f0 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2992,6 +2992,10 @@ typedef struct IndexStmt
bool reset_default_tblspc; /* reset default_tablespace prior to
* executing */
bool global_index; /* true if index is global */
+ int globalIndexPart; /* partition number indication */
+ int nparts; /* num partitions for global index build in
+ * parallel */
+
} IndexStmt;
/* ----------------------
diff --git a/src/include/storage/sharedfileset.h b/src/include/storage/sharedfileset.h
index b1cde36d0b..50580d5140 100644
--- a/src/include/storage/sharedfileset.h
+++ b/src/include/storage/sharedfileset.h
@@ -33,5 +33,6 @@ typedef struct SharedFileSet
extern void SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg);
extern void SharedFileSetAttach(SharedFileSet *fileset, dsm_segment *seg);
extern void SharedFileSetDeleteAll(SharedFileSet *fileset);
+extern void SharedFileSetRegisterCleanupCallback(SharedFileSet *fileset, dsm_segment *seg);
#endif
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 15f2a4a795..ccc389c1f4 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -247,6 +247,16 @@ typedef struct
elog(ERROR, "unexpected end of data"); \
} while(0)
+extern void tuplesort_mark_global_sort(Tuplesortstate *state);
+
+extern void tuplesort_copy_sharedsort(Sharedsort *shared1, Sharedsort *shared2);
+
+extern void tuplesort_copy_sharedsort2(Sharedsort *shared1, Tuplesortstate *state);
+
+extern int tuplesort_get_curr_workers(Sharedsort *shared);
+
+extern void tuplesort_register_cleanup_callback(Sharedsort *shared, dsm_segment *seg);
+
/*
* We provide multiple interfaces to what is essentially the same code,
* since different callers have different data to be sorted and want to
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index eed20063c1..30a3ce0f20 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1461,3 +1461,15 @@ Partitions: gidxpart1_b_idx,
drop index gidx_u;
drop table gidxpart;
+-- Test the cross-partition uniqueness with non-partition key with global unique index
+create table gidxpart (a int, b int, c text) partition by range (a);
+create table gidxpart1 partition of gidxpart for values from (0) to (100000);
+create table gidxpart2 partition of gidxpart for values from (100000) to (199999);
+insert into gidxpart (a, b, c) values (42, 572814, 'inserted first on gidxpart1');
+insert into gidxpart (a, b, c) values (150000, 572814, 'inserted second on gidxpart2');
+create unique index on gidxpart (b) global; -- should fail
+ERROR: could not create unique index "gidxpart1_b_idx"
+DETAIL: Key (b)=(572814) is duplicated.
+delete from gidxpart where a = 150000 and b = 572814;
+create unique index on gidxpart (b) global;
+drop table gidxpart;
diff --git a/src/test/regress/sql/indexing.sql b/src/test/regress/sql/indexing.sql
index 2169f28e69..84dde4df93 100644
--- a/src/test/regress/sql/indexing.sql
+++ b/src/test/regress/sql/indexing.sql
@@ -771,3 +771,15 @@ select relname, relhasindex, relkind from pg_class where relname like '%gidx%' o
\d+ gidx_u
drop index gidx_u;
drop table gidxpart;
+
+-- Test the cross-partition uniqueness with non-partition key with global unique index
+create table gidxpart (a int, b int, c text) partition by range (a);
+create table gidxpart1 partition of gidxpart for values from (0) to (100000);
+create table gidxpart2 partition of gidxpart for values from (100000) to (199999);
+insert into gidxpart (a, b, c) values (42, 572814, 'inserted first on gidxpart1');
+insert into gidxpart (a, b, c) values (150000, 572814, 'inserted second on gidxpart2');
+create unique index on gidxpart (b) global; -- should fail
+delete from gidxpart where a = 150000 and b = 572814;
+create unique index on gidxpart (b) global;
+drop table gidxpart;
+
--
2.17.1
0003-support-global-unique-index-attach-and-detach.patchapplication/octet-stream; name=0003-support-global-unique-index-attach-and-detach.patchDownload
From f960543a0d1c10f9ccbb9bb19b994b47d5929365 Mon Sep 17 00:00:00 2001
From: David Zhang <david.zhang@highgo.ca>
Date: Thu, 17 Nov 2022 12:24:35 -0800
Subject: [PATCH 3/4] support global unique index attach and detach
---
src/backend/commands/tablecmds.c | 214 +++++++++++++++++++++++++
src/test/regress/expected/indexing.out | 55 +++++++
src/test/regress/sql/indexing.sql | 33 ++++
3 files changed, 302 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b16a1180d8..5aca295d29 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -645,6 +645,7 @@ static void ATDetachCheckNoForeignKeyRefs(Relation partition);
static char GetAttributeCompression(Oid atttypid, char *compression);
static char GetAttributeStorage(Oid atttypid, const char *storagemode);
+static bool HasGlobalChildIndex(Relation idxRel);
/* ----------------------------------------------------------------
* DefineRelation
@@ -1217,6 +1218,11 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
idxstmt =
generateClonedIndexStmt(NULL, idxRel,
attmap, &constraintOid);
+ if (HasGlobalChildIndex(idxRel))
+ {
+ elog(DEBUG2, "create global index for the new child partition table");
+ idxstmt->global_index = true;
+ }
DefineIndex(RelationGetRelid(rel),
idxstmt,
InvalidOid,
@@ -17992,6 +17998,7 @@ AttachPartitionEnsureIndexes(Relation rel, Relation attachrel)
IndexInfo **attachInfos;
int i;
ListCell *cell;
+ ListCell *cell2;
MemoryContext cxt;
MemoryContext oldcxt;
@@ -18130,15 +18137,185 @@ AttachPartitionEnsureIndexes(Relation rel, Relation attachrel)
{
IndexStmt *stmt;
Oid conOid;
+ bool isGlobal = false;
+ isGlobal = HasGlobalChildIndex(idxRel);
stmt = generateClonedIndexStmt(NULL,
idxRel, attmap,
&conOid);
+
+ /*
+ * Perform cross partition uniqueness check if it is a global
+ * unique index
+ */
+ if (isGlobal && idxRel->rd_index->indisunique)
+ {
+ PartitionDesc partdesc;
+ Relation hRel;
+ Relation iRel;
+ int j = 0;
+ int nparts;
+ Oid *part_oids;
+
+ List *childIndexList = find_inheritance_children(idx, ShareLock);
+
+ partdesc = RelationGetPartitionDesc(rel, true);
+ nparts = partdesc->nparts;
+ part_oids = palloc(sizeof(Oid) * nparts);
+
+ memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
+ for (j = 0; j < nparts; j++)
+ {
+ Oid childRelid = part_oids[j];
+ List *childidxs;
+
+ if (childRelid == RelationGetRelid(attachrel))
+ {
+ elog(DEBUG2, "skip the partition-to-be from building global spool: %d", childRelid);
+ continue;
+ }
+ hRel = table_open(childRelid, AccessShareLock);
+
+ childidxs = RelationGetIndexList(hRel);
+ foreach(cell2, childidxs)
+ {
+ Oid cldidxid = lfirst_oid(cell2);
+
+ /*
+ * only take a child index that is directly inherited
+ * to parent index oid
+ */
+ if (list_member_oid(childIndexList, cldidxid))
+ {
+ iRel = index_open(cldidxid, AccessShareLock);
+ elog(DEBUG2, "found a matching child index OID to build global spool %d", cldidxid);
+
+ /*
+ * We need to construct a global spool structure
+ * in nbtsort.c in order to determine global
+ * uniqueness. Marking partitions now
+ */
+ if (j == 0)
+ {
+ elog(DEBUG2, "mark as first partitioned to build global spool");
+ stmt->globalIndexPart = -1;
+ }
+ else
+ stmt->globalIndexPart = 0;
+
+ stmt->global_index = true;
+ stmt->nparts = nparts;
+
+ PopulateGlobalSpool(iRel, hRel, stmt);
+ index_close(iRel, NoLock);
+ break;
+ }
+ }
+ table_close(hRel, NoLock);
+ }
+ elog(DEBUG2, "mark as the last partitioned to utilize global spool");
+ stmt->globalIndexPart = 1;
+ }
+ else
+ {
+ elog(DEBUG2, "partitioned index %d is not a unique index, build it now...",
+ RelationGetRelid(idxRel));
+ }
+
DefineIndex(RelationGetRelid(attachrel), stmt, InvalidOid,
RelationGetRelid(idxRel),
conOid,
true, false, false, false, false);
}
+ else
+ {
+ IndexStmt *stmt;
+ Oid conOid;
+ bool isGlobal = false;
+
+ stmt = generateClonedIndexStmt(NULL,
+ idxRel, attmap,
+ &conOid);
+ isGlobal = HasGlobalChildIndex(idxRel);
+ if (isGlobal && idxRel->rd_index->indisunique)
+ {
+ PartitionDesc partdesc;
+ Relation hRel;
+ Relation iRel;
+ int j = 0;
+ int nparts;
+ Oid *part_oids;
+
+ List *childIndexList = find_inheritance_children(idx, ShareLock);
+
+ partdesc = RelationGetPartitionDesc(rel, true);
+ nparts = partdesc->nparts;
+ part_oids = palloc(sizeof(Oid) * nparts);
+
+ memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
+ for (j = 0; j < nparts; j++)
+ {
+ Oid childRelid = part_oids[j];
+ List *childidxs;
+
+ hRel = table_open(childRelid, AccessShareLock);
+
+ childidxs = RelationGetIndexList(hRel);
+ foreach(cell2, childidxs)
+ {
+ Oid cldidxid = lfirst_oid(cell2);
+
+ /*
+ * only take a child index that is directly inherited
+ * to parent index oid
+ */
+ if (list_member_oid(childIndexList, cldidxid))
+ {
+ iRel = index_open(cldidxid, AccessShareLock);
+ elog(DEBUG2, "found a matching child index OID type %c to build global spool %d",
+ iRel->rd_rel->relkind, cldidxid);
+
+ /*
+ * change partition-to-be's duplicate unique index
+ * relkind to RELKIND_GLOBAL_INDEX
+ */
+ if (iRel->rd_rel->relkind != RELKIND_GLOBAL_INDEX)
+ {
+ elog(DEBUG2, "Update index relation %d to have relkind = RELKIND_GLOBAL_INDEX",
+ RelationGetRelid(iRel));
+ ChangeRelKind(iRel, RELKIND_GLOBAL_INDEX);
+ }
+
+ /*
+ * We need to construct a global spool structure
+ * in nbtsort.c in order to determine global
+ * uniqueness. Marking partitions now
+ */
+ if (j == 0)
+ {
+ elog(DEBUG2, "mark as first partition to build global spool");
+ stmt->globalIndexPart = -1;
+ }
+ else if (j == nparts - 1)
+ {
+ elog(DEBUG2, "mark as last partition to build global spool");
+ stmt->globalIndexPart = 1;
+ }
+ else
+ stmt->globalIndexPart = 0;
+
+ stmt->global_index = true;
+ stmt->nparts = nparts;
+
+ PopulateGlobalSpool(iRel, hRel, stmt);
+ index_close(iRel, NoLock);
+ break;
+ }
+ }
+ table_close(hRel, NoLock);
+ }
+ }
+ }
index_close(idxRel, AccessShareLock);
}
@@ -18636,6 +18813,15 @@ DetachPartitionFinalize(Relation rel, Relation partRel, bool concurrent,
if (OidIsValid(constrOid))
ConstraintSetParentConstraint(constrOid, InvalidOid, InvalidOid);
+ /*
+ * if it has any global index, make it a regular index relkind after
+ * detach
+ */
+ if (idx->rd_rel->relkind == RELKIND_GLOBAL_INDEX)
+ {
+ elog(DEBUG2, "found a global index, transform it to RELKIND_INDEX at detach...");
+ ChangeRelKind(idx, RELKIND_INDEX);
+ }
index_close(idx, NoLock);
}
@@ -19357,3 +19543,31 @@ GetAttributeStorage(Oid atttypid, const char *storagemode)
return cstorage;
}
+
+static bool
+HasGlobalChildIndex(Relation idxRel)
+{
+ /* find out if current index contains child indexes that are global */
+ List *childIndexOidList;
+ ListCell *cell;
+ bool isGlobal = false;;
+
+ childIndexOidList = find_all_inheritors(RelationGetRelid(idxRel),
+ AccessExclusiveLock, NULL);
+
+ foreach(cell, childIndexOidList)
+ {
+ Oid childIndexOid = lfirst_oid(cell);
+ Relation idxChildRel;
+
+ idxChildRel = index_open(childIndexOid, AccessShareLock);
+ if (idxChildRel->rd_rel->relkind == RELKIND_GLOBAL_INDEX)
+ {
+ isGlobal = true;
+ index_close(idxChildRel, NoLock);
+ break;
+ }
+ index_close(idxChildRel, NoLock);
+ }
+ return isGlobal;
+}
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 30a3ce0f20..58de14c037 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1473,3 +1473,58 @@ DETAIL: Key (b)=(572814) is duplicated.
delete from gidxpart where a = 150000 and b = 572814;
create unique index on gidxpart (b) global;
drop table gidxpart;
+-- Test partition attach and detach with global unique index (no existing index)
+create table gidxpart (a int, b int, c text) partition by range (a);
+create table gidxpart1 partition of gidxpart for values from (0) to (100000);
+insert into gidxpart (a, b, c) values (42, 572814, 'inserted first on gidxpart1');
+create unique index on gidxpart (b) global;
+create table gidxpart2 (a int, b int, c text);
+insert into gidxpart2 (a, b, c) values (150000, 572814, 'dup inserted on gidxpart2');
+alter table gidxpart attach partition gidxpart2 for values from (100000) to (199999); -- should fail
+ERROR: could not create unique index "gidxpart1_b_idx"
+DETAIL: Key (b)=(572814) is duplicated.
+update gidxpart2 set b = 5000;
+alter table gidxpart attach partition gidxpart2 for values from (100000) to (199999);
+select relname, relkind from pg_class where relname = 'gidxpart2_b_idx'; -- should be g
+ relname | relkind
+-----------------+---------
+ gidxpart2_b_idx | g
+(1 row)
+
+alter table gidxpart detach partition gidxpart2;
+select relname, relkind from pg_class where relname = 'gidxpart2_b_idx'; -- should be i
+ relname | relkind
+-----------------+---------
+ gidxpart2_b_idx | i
+(1 row)
+
+drop table gidxpart;
+drop table gidxpart2;
+-- Test partition attach and detach with global unique index (with duplicate index)
+create table gidxpart (a int, b int, c text) partition by range (a);
+create table gidxpart1 partition of gidxpart for values from (0) to (100000);
+insert into gidxpart (a, b, c) values (42, 572814, 'inserted first on gidxpart1');
+create unique index on gidxpart (b) global;
+create table gidxpart2 (a int, b int, c text);
+create unique index on gidxpart2 (b);
+insert into gidxpart2 (a, b, c) values (150000, 572814, 'dup inserted on gidxpart2');
+select relname, relkind from pg_class where relname = 'gidxpart2_b_idx'; -- should be i
+ relname | relkind
+-----------------+---------
+ gidxpart2_b_idx | i
+(1 row)
+
+alter table gidxpart attach partition gidxpart2 for values from (100000) to (199999); -- should fail
+ERROR: could not create unique index "gidxpart1_b_idx"
+DETAIL: Key (b)=(572814) is duplicated.
+update gidxpart2 set b = 5000;
+alter table gidxpart attach partition gidxpart2 for values from (100000) to (199999);
+select relname, relkind from pg_class where relname = 'gidxpart2_b_idx'; -- should be g
+ relname | relkind
+-----------------+---------
+ gidxpart2_b_idx | g
+(1 row)
+
+alter table gidxpart detach partition gidxpart2;
+drop table gidxpart;
+drop table gidxpart2;
diff --git a/src/test/regress/sql/indexing.sql b/src/test/regress/sql/indexing.sql
index 84dde4df93..78649bb5ca 100644
--- a/src/test/regress/sql/indexing.sql
+++ b/src/test/regress/sql/indexing.sql
@@ -783,3 +783,36 @@ delete from gidxpart where a = 150000 and b = 572814;
create unique index on gidxpart (b) global;
drop table gidxpart;
+-- Test partition attach and detach with global unique index (no existing index)
+create table gidxpart (a int, b int, c text) partition by range (a);
+create table gidxpart1 partition of gidxpart for values from (0) to (100000);
+insert into gidxpart (a, b, c) values (42, 572814, 'inserted first on gidxpart1');
+create unique index on gidxpart (b) global;
+create table gidxpart2 (a int, b int, c text);
+insert into gidxpart2 (a, b, c) values (150000, 572814, 'dup inserted on gidxpart2');
+alter table gidxpart attach partition gidxpart2 for values from (100000) to (199999); -- should fail
+update gidxpart2 set b = 5000;
+alter table gidxpart attach partition gidxpart2 for values from (100000) to (199999);
+select relname, relkind from pg_class where relname = 'gidxpart2_b_idx'; -- should be g
+alter table gidxpart detach partition gidxpart2;
+select relname, relkind from pg_class where relname = 'gidxpart2_b_idx'; -- should be i
+drop table gidxpart;
+drop table gidxpart2;
+
+-- Test partition attach and detach with global unique index (with duplicate index)
+create table gidxpart (a int, b int, c text) partition by range (a);
+create table gidxpart1 partition of gidxpart for values from (0) to (100000);
+insert into gidxpart (a, b, c) values (42, 572814, 'inserted first on gidxpart1');
+create unique index on gidxpart (b) global;
+create table gidxpart2 (a int, b int, c text);
+create unique index on gidxpart2 (b);
+insert into gidxpart2 (a, b, c) values (150000, 572814, 'dup inserted on gidxpart2');
+select relname, relkind from pg_class where relname = 'gidxpart2_b_idx'; -- should be i
+alter table gidxpart attach partition gidxpart2 for values from (100000) to (199999); -- should fail
+update gidxpart2 set b = 5000;
+alter table gidxpart attach partition gidxpart2 for values from (100000) to (199999);
+select relname, relkind from pg_class where relname = 'gidxpart2_b_idx'; -- should be g
+alter table gidxpart detach partition gidxpart2;
+drop table gidxpart;
+drop table gidxpart2;
+
--
2.17.1
0004-support-global-unique-index-insert-and-update.patchapplication/octet-stream; name=0004-support-global-unique-index-insert-and-update.patchDownload
From 8be88eaebd1beeb98ebaa49b3053009c6a0c6ed0 Mon Sep 17 00:00:00 2001
From: David Zhang <david.zhang@highgo.ca>
Date: Thu, 17 Nov 2022 12:26:25 -0800
Subject: [PATCH 4/4] support global unique index insert and update
---
src/backend/access/nbtree/nbtinsert.c | 30 +++++-
src/backend/access/nbtree/nbtree.c | 123 ++++++++++++++++++++++++-
src/include/access/nbtree.h | 5 +
src/test/regress/expected/indexing.out | 41 +++++++++
src/test/regress/sql/indexing.sql | 20 ++++
5 files changed, 213 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f6f4af8bfe..ad95726ea9 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -34,7 +34,7 @@ static BTStack _bt_search_insert(Relation rel, BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
- uint32 *speculativeToken);
+ uint32 *speculativeToken, Relation origHeapRel);
static OffsetNumber _bt_findinsertloc(Relation rel,
BTInsertState insertstate,
bool checkingunique,
@@ -73,6 +73,11 @@ static BlockNumber *_bt_deadblocks(Page page, OffsetNumber *deletable,
int *nblocks);
static inline int _bt_blk_cmp(const void *arg1, const void *arg2);
+TransactionId _bt_check_unique_gi(Relation rel, BTInsertState insertstate,
+ Relation heapRel,
+ IndexUniqueCheck checkUnique, bool *is_unique,
+ uint32 *speculativeToken, Relation origHeapRel);
+
/*
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
*
@@ -206,7 +211,7 @@ search:
uint32 speculativeToken;
xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
- &is_unique, &speculativeToken);
+ &is_unique, &speculativeToken, NULL);
if (unlikely(TransactionIdIsValid(xwait)))
{
@@ -379,6 +384,15 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
NULL);
}
+TransactionId
+_bt_check_unique_gi(Relation rel, BTInsertState insertstate, Relation heapRel,
+ IndexUniqueCheck checkUnique, bool *is_unique,
+ uint32 *speculativeToken, Relation origHeapRel)
+{
+ return _bt_check_unique(rel, insertstate, heapRel, checkUnique,
+ is_unique, speculativeToken, origHeapRel);
+}
+
/*
* _bt_check_unique() -- Check for violation of unique index constraint
*
@@ -405,7 +419,7 @@ _bt_search_insert(Relation rel, BTInsertState insertstate)
static TransactionId
_bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
IndexUniqueCheck checkUnique, bool *is_unique,
- uint32 *speculativeToken)
+ uint32 *speculativeToken, Relation origHeapRel)
{
IndexTuple itup = insertstate->itup;
IndexTuple curitup = NULL;
@@ -560,6 +574,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
&all_dead))
{
TransactionId xwait;
+ bool idx_fetch_result;
/*
* It is a duplicate. If we are only doing a partial
@@ -613,8 +628,13 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
* entry.
*/
htid = itup->t_tid;
- if (table_index_fetch_tuple_check(heapRel, &htid,
- SnapshotSelf, NULL))
+ if (origHeapRel)
+ idx_fetch_result = table_index_fetch_tuple_check(origHeapRel, &htid,
+ SnapshotSelf, NULL);
+ else
+ idx_fetch_result = table_index_fetch_tuple_check(heapRel, &htid,
+ SnapshotSelf, NULL);
+ if (idx_fetch_result)
{
/* Normal case --- it's still live */
}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b52eca8f38..84dc58ba38 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,8 @@
#include "access/relscan.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
+#include "access/table.h"
+#include "catalog/partition.h"
#include "commands/progress.h"
#include "commands/vacuum.h"
#include "miscadmin.h"
@@ -34,9 +36,11 @@
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/smgr.h"
+#include "storage/predicate.h"
#include "utils/builtins.h"
#include "utils/index_selfuncs.h"
#include "utils/memutils.h"
+#include "partitioning/partdesc.h"
/*
@@ -86,7 +90,9 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
int *nremaining);
-
+static void
+ btinsert_check_unique_gi(IndexTuple itup, Relation idxRel,
+ Relation heapRel, IndexUniqueCheck checkUnique);
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -177,6 +183,118 @@ btbuildempty(Relation index)
smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
}
+/*
+ * btinsert_check_unique_gi() -- cross partitions uniqueness check.
+ *
+ * loop all partitions with global index for uniqueness check.
+ */
+static void
+btinsert_check_unique_gi(IndexTuple itup, Relation idxRel,
+ Relation heapRel, IndexUniqueCheck checkUnique)
+{
+ bool is_unique = false;
+ BTScanInsert itup_key = _bt_mkscankey(idxRel, itup);
+
+ if (!itup_key->anynullkeys &&
+ idxRel->rd_rel->relkind == RELKIND_GLOBAL_INDEX)
+ {
+ Oid parentId;
+ Relation parentTbl;
+ PartitionDesc partDesc;
+ int i;
+ int nparts;
+ Oid *partOids;
+
+ itup_key->scantid = NULL;
+ parentId = heapRel->rd_rel->relispartition ?
+ get_partition_parent(idxRel->rd_index->indrelid, false) : InvalidOid;
+ parentTbl = table_open(parentId, AccessShareLock);
+ partDesc = RelationGetPartitionDesc(parentTbl, true);
+ nparts = partDesc->nparts;
+ partOids = palloc(sizeof(Oid) * nparts);
+ memcpy(partOids, partDesc->oids, sizeof(Oid) * nparts);
+ for (i = 0; i < nparts; i++)
+ {
+ Oid childRelid = partOids[i];
+ List *childidxs;
+ ListCell *cell;
+
+ if (childRelid != heapRel->rd_rel->oid)
+ {
+ Relation hRel = table_open(childRelid, AccessShareLock);
+
+ childidxs = RelationGetIndexList(hRel);
+ foreach(cell, childidxs)
+ {
+ Oid cldidxid = lfirst_oid(cell);
+ Relation iRel = index_open(cldidxid, AccessShareLock);
+
+ if (iRel->rd_rel->relkind == RELKIND_GLOBAL_INDEX
+ && iRel->rd_rel->oid != idxRel->rd_rel->oid)
+ {
+ BTStack stack;
+ uint32 speculativeToken;
+ BTInsertStateData insertstate;
+ TransactionId xwait = InvalidBuffer;
+
+ insertstate.itup = itup;
+ insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
+ insertstate.itup_key = itup_key;
+ insertstate.bounds_valid = false;
+ insertstate.buf = InvalidBuffer;
+ insertstate.postingoff = 0;
+
+ search_global:
+ stack = _bt_search(iRel, insertstate.itup_key,
+ &insertstate.buf, BT_READ, NULL);
+ xwait = _bt_check_unique_gi(iRel, &insertstate,
+ hRel, checkUnique, &is_unique,
+ &speculativeToken, heapRel);
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
+ /* Have to wait for the other guy ... */
+ if (insertstate.buf)
+ {
+ _bt_relbuf(iRel, insertstate.buf);
+ insertstate.buf = InvalidBuffer;
+ }
+
+ /*
+ * If it's a speculative insertion, wait for it to
+ * finish (ie. to go ahead with the insertion, or
+ * kill the tuple). Otherwise wait for the
+ * transaction to finish as usual.
+ */
+ if (speculativeToken)
+ SpeculativeInsertionWait(xwait, speculativeToken);
+ else
+ XactLockTableWait(xwait, iRel, &itup->t_tid, XLTW_InsertIndex);
+
+ /* start over... */
+ if (stack)
+ _bt_freestack(stack);
+ goto search_global;
+ }
+ if (insertstate.buf)
+ _bt_relbuf(iRel, insertstate.buf);
+ if (stack)
+ _bt_freestack(stack);
+ }
+ index_close(iRel, AccessShareLock);
+ }
+ if (childidxs)
+ list_free(childidxs);
+ table_close(hRel, AccessShareLock);
+ }
+ }
+ if (partOids)
+ pfree(partOids);
+ table_close(parentTbl, AccessShareLock);
+ }
+ if (itup_key)
+ pfree(itup_key);
+}
+
/*
* btinsert() -- insert an index tuple into a btree.
*
@@ -199,6 +317,9 @@ btinsert(Relation rel, Datum *values, bool *isnull,
result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
+ if (checkUnique != UNIQUE_CHECK_NO)
+ btinsert_check_unique_gi(itup, rel, heapRel, checkUnique);
+
pfree(itup);
return result;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8e4f6864e5..19761a4c31 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1284,4 +1284,9 @@ extern IndexBuildResult *btbuild(Relation heap, Relation index,
struct IndexInfo *indexInfo);
extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern TransactionId _bt_check_unique_gi(Relation rel, BTInsertState insertstate,
+ Relation heapRel,
+ IndexUniqueCheck checkUnique, bool *is_unique,
+ uint32 *speculativeToken, Relation origHeapRel);
+
#endif /* NBTREE_H */
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 58de14c037..f0102fabe6 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1459,6 +1459,47 @@ unique, btree, for table "public.gidxpart"
Partitions: gidxpart1_b_idx,
gidxpart2_b_idx
+-- cross-partition uniqueness check for insert and update
+insert into gidxpart values (1, 1, 'first');
+insert into gidxpart values (11, 11, 'eleventh');
+insert into gidxpart values (2, 11, 'duplicated (b)=(11) on other partition');
+ERROR: duplicate key value violates unique constraint "gidxpart2_b_idx"
+DETAIL: Key (b)=(11) already exists.
+insert into gidxpart values (12, 1, 'duplicated (b)=(1) on other partition');
+ERROR: duplicate key value violates unique constraint "gidxpart1_b_idx"
+DETAIL: Key (b)=(1) already exists.
+insert into gidxpart values (2, 120, 'second');
+insert into gidxpart values (12, 2, 'twelfth');
+update gidxpart set b=2 where a=2;
+ERROR: duplicate key value violates unique constraint "gidxpart2_b_idx"
+DETAIL: Key (b)=(2) already exists.
+update gidxpart set b=1 where a=12;
+ERROR: duplicate key value violates unique constraint "gidxpart1_b_idx"
+DETAIL: Key (b)=(1) already exists.
+update gidxpart set b=12 where a=12;
+update gidxpart set b=2 where a=2;
+select * from gidxpart;
+ a | b | c
+----+----+----------
+ 1 | 1 | first
+ 2 | 2 | second
+ 11 | 11 | eleventh
+ 12 | 12 | twelfth
+(4 rows)
+
+-- cross-partition uniqueness check applys to newly created partition
+create table gidxpart3 partition of gidxpart for values from (100) to (200);
+select relname, relkind from pg_class where relname = 'gidxpart3_b_idx';
+ relname | relkind
+-----------------+---------
+ gidxpart3_b_idx | g
+(1 row)
+
+insert into gidxpart values (150, 11, 'duplicated (b)=(11) on other partition');
+ERROR: duplicate key value violates unique constraint "gidxpart2_b_idx"
+DETAIL: Key (b)=(11) already exists.
+insert into gidxpart values (150, 13, 'no duplicate b');
+-- clean up global index tests
drop index gidx_u;
drop table gidxpart;
-- Test the cross-partition uniqueness with non-partition key with global unique index
diff --git a/src/test/regress/sql/indexing.sql b/src/test/regress/sql/indexing.sql
index 78649bb5ca..42ee1ce19f 100644
--- a/src/test/regress/sql/indexing.sql
+++ b/src/test/regress/sql/indexing.sql
@@ -769,6 +769,26 @@ create unique index gidx_u on gidxpart using btree(b) global;
select relname, relhasindex, relkind from pg_class where relname like '%gidx%' order by oid;
\d+ gidxpart
\d+ gidx_u
+-- cross-partition uniqueness check for insert and update
+insert into gidxpart values (1, 1, 'first');
+insert into gidxpart values (11, 11, 'eleventh');
+insert into gidxpart values (2, 11, 'duplicated (b)=(11) on other partition');
+insert into gidxpart values (12, 1, 'duplicated (b)=(1) on other partition');
+insert into gidxpart values (2, 120, 'second');
+insert into gidxpart values (12, 2, 'twelfth');
+update gidxpart set b=2 where a=2;
+update gidxpart set b=1 where a=12;
+update gidxpart set b=12 where a=12;
+update gidxpart set b=2 where a=2;
+select * from gidxpart;
+
+-- cross-partition uniqueness check applys to newly created partition
+create table gidxpart3 partition of gidxpart for values from (100) to (200);
+select relname, relkind from pg_class where relname = 'gidxpart3_b_idx';
+insert into gidxpart values (150, 11, 'duplicated (b)=(11) on other partition');
+insert into gidxpart values (150, 13, 'no duplicate b');
+
+-- clean up global index tests
drop index gidx_u;
drop table gidxpart;
--
2.17.1
Import Notes
Reply to msg id not found:
Hello
Do we need new syntax actually? I think that a global unique index can be created automatically instead of raising an error "unique constraint on partitioned table must include all partitioning columns"
regards, Sergei
pá 18. 11. 2022 v 10:04 odesílatel Sergei Kornilov <sk@zsrv.org> napsal:
Hello
Do we need new syntax actually? I think that a global unique index can be
created automatically instead of raising an error "unique constraint on
partitioned table must include all partitioning columns"
+1
Pavel
Show quoted text
regards, Sergei
Sergei Kornilov <sk@zsrv.org> writes:
Do we need new syntax actually? I think that a global unique index can be created automatically instead of raising an error "unique constraint on partitioned table must include all partitioning columns"
I'm not convinced that we want this feature at all: as far as I can see,
it will completely destroy the benefits of making a partitioned table
in the first place. But if we do want it, I don't think it should be
so easy to create a global index by accident as that syntax approach
would make it. I think there needs to be a pretty clear YES I WANT TO
SHOOT MYSELF IN THE FOOT clause in the command.
regards, tom lane
pá 18. 11. 2022 v 16:06 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
Sergei Kornilov <sk@zsrv.org> writes:
Do we need new syntax actually? I think that a global unique index can
be created automatically instead of raising an error "unique constraint on
partitioned table must include all partitioning columns"I'm not convinced that we want this feature at all: as far as I can see,
it will completely destroy the benefits of making a partitioned table
in the first place. But if we do want it, I don't think it should be
so easy to create a global index by accident as that syntax approach
would make it. I think there needs to be a pretty clear YES I WANT TO
SHOOT MYSELF IN THE FOOT clause in the command.
isn't possible to have a partitioned index?
https://www.highgo.ca/2022/10/14/global-index-a-different-approach/
Regards
Pavel
Show quoted text
regards, tom lane
On Thu, 17 Nov 2022 at 22:01, Cary Huang <cary.huang@highgo.ca> wrote:
Patch: Global Unique Index
Let me start by expressing severe doubt on the usefulness of such a
feature, but also salute your efforts to contribute.
In other words, a global unique index and a regular partitioned index are essentially the same in terms of their storage structure except that one can do cross-partition uniqueness check, the other cannot.
This is the only workable architecture, since it allows DETACH to be
feasible, which is essential.
You don't seem to mention that this would require a uniqueness check
on each partition. Is that correct? This would result in O(N) cost of
uniqueness checks, severely limiting load speed. I notice you don't
offer any benchmarks on load speed or the overhead associated with
this, which is not good if you want to be taken seriously, but at
least it is recoverable.
(It might be necessary to specify some partitions as READ ONLY, to
allow us to record their min/max values for the indexed cols, allowing
us to do this more quickly.)
- Supported Features -
1. Global unique index is supported only on btree index type
Why? Surely any index type that supports uniqueness is good.
- Not-supported Features -
1. Global uniqueness check with Sub partition tables is not yet supported as we do not have immediate use case and it may involve majoy change in current implementation
Hmm, sounds like a problem. Arranging the calls recursively should work.
- Create a global unique index -
To create a regular unique index on a partitioned table, Postgres has to perform heap scan and sorting on every child partition. Uniqueness check happens during the sorting phase and will raise an error if multiple tuples with the same index key are sorted together. To achieve global uniqueness check, we make Postgres perform the sorting after all of the child partitions have been scanned instead of on the "sort per partition" fashion. In otherwords, the sorting only happens once at the very end and it sorts the tuples coming from all the partitions and therefore can ensure global uniqueness.
My feeling is that performance on this will suck so badly that we must
warn people away from it, and tell people if they want this, create
the index at the start and let it load.
Hopefully CREATE INDEX CONCURRENTLY still works.
Let's see some benchmarks on this also please.
You'll need to think about progress reporting early because correctly
reporting the progress and expected run times are likely critical for
usability.
Example:
CREATE TABLE gidx_part (a int, b int, c text) PARTITION BY RANGE (a);
CREATE TABLE gidx_part1 partition of gidx_part FOR VALUES FROM (0) TO (10);
CREATE TABLE gidx_part2 partition of gidx_part FOR VALUES FROM (10) TO (20);
CREATE UNIQUE INDEX global_unique_idx ON gidx_part USING BTREE(b) GLOBAL;
INSERT INTO gidx_part values(5, 5, 'test');
INSERT INTO gidx_part values(15, 5, 'test');ERROR: duplicate key value violates unique constraint "gidx_part1_b_idx"
DETAIL: Key (b)=(5) already exists.
Well done.
- DETACH -
Since we retain the same partitioned structure, detaching a partition with global unique index is straightforward. Upon DETACH, Postgres will change its relkind from RELKIND_GLOBAL_INDEX to RELKIND_INDEX and remove their inheritance relationship as usual.
It's the only way that works
- Optimizer, query planning and vacuum -
Since no major modification is done on global unique index's structure and storage, it works in the same way as a regular partitioned index. No major change is required to be done on optimizer, planner and vacuum process as they should work in the same way as regular index.
Agreed
Making a prototype is a great first step.
The next step is to understand the good and the bad aspects of it, so
you can see what else needs to be done. You need to be honest and real
about the fact that this may not actually be desirable in practice, or
in a restricted use case.
That means performance analysis of create, load, attach, detach,
INSERT, SELECT, UPD/DEL and anything else that might be affected,
together with algorithmic analysis of what happens for larger N and
larger tables.
Expect many versions; take provisions for many days.
Best of luck
--
Simon Riggs http://www.EnterpriseDB.com/
Hi Simon
Thank you so much for sharing these valuable comments and concerns to our work. We understand there is a lot of TODOs left to be done to move forward with this in a serious matter. Your comments have been very helpful and we are very grateful.
You don't seem to mention that this would require a uniqueness check
on each partition. Is that correct? This would result in O(N) cost of
uniqueness checks, severely limiting load speed. I notice you don't
offer any benchmarks on load speed or the overhead associated with
this, which is not good if you want to be taken seriously, but at
least it is recoverable.
Yes, during INSERT and UPDATE, the uniqueness check happens on every partition including the current one. This introduces extra look-up costs and will limit the speed significantly especially when there is a large number of partitions. This is one drawback of global unique index that needs to be optimized / improved.
In fact, all other operations such as CREATE and ATTACH that involve global uniqueness check will have certain degree of performance loss as well. See benchmark figures below.
(It might be necessary to specify some partitions as READ ONLY, to
allow us to record their min/max values for the indexed cols, allowing
us to do this more quickly.)
Thank you so much for this great suggestion, If there were an indication that some partitions have become READ ONLY, record the min/max values of their global unique indexed columns to these partitions, then we might be able to skip these partitions for uniqueness checking if the value is out of the range (min/max)? Did we understand it correctly? Could you help elaborate more?
1. Global unique index is supported only on btree index type
Why? Surely any index type that supports uniqueness is good.
Yes, we can definitely have the same support for other index types that support UNIQUE.
- Not-supported Features -
1. Global uniqueness check with Sub partition tables is not yet supported as we do not have immediate use case and it may involve major change in current implementation
Hmm, sounds like a problem. Arranging the calls recursively should work.
Yes, it is a matter of rearranging the recursive calls to correctly find out all "leaves" partitions to be considered for global uniqueness check. So far, only the partitions in the first layer is considered.
My feeling is that performance on this will suck so badly that we must
warn people away from it, and tell people if they want this, create
the index at the start and let it load.
Yes, to support global unique index, extra logic needs to be run to ensure uniqueness and especially during INSERT and ATTACH where it needs to look up all involved partitions. We have a benchmark figures attached below.
This is also the reason that "global" syntax is required so people know they really want to have this feature. To help users better understand the potential performance drawbacks, should we add a warning in the documentation?
Hopefully CREATE INDEX CONCURRENTLY still works.
Yes, we verified this global unique index approach on Postgres 14.5 with a community CREATE INDEX CONCURRENTLY patch on partitioned table.
Let's see some benchmarks on this also please.
Here is a simple 'timing' comparison between regular and global unique index on a partitioned table having 6 partitions.
global unique index:
-> 156,285ms to insert 6 million records (1 million on each partition)
-> 6,592ms to delete all 6 million records
-> 3,957ms to create global unique index with 6 million records pre-inserted
-> 3,650ms to attach a new partition with 1 million records pre-inserted
-> 17ms to detach a partition with 1 million records in it
regular unique index:
-> 26,007ms to insert 6 million records (1 million on each partition)
-> 7,238ms to delete all 6 million records
-> 2,933ms to create regular unique index with 6 million records pre-inserted
-> 628ms to attach a new partition with 1 million records pre-inserted
-> 17ms to detach a partition with 1 million records in it
These are the commands I use to get the numbers (switch create unique index clause between global and regular):
-> \timing on
-> create table test(a int, b int, c text) partition by range (a);
-> create table test1 partition of test for values from (MINVALUE) to (1000000);
-> create table test2 partition of test for values from (1000000) to (2000000);
-> create table test3 partition of test for values from (2000000) to (3000000);
-> create table test4 partition of test for values from (3000000) to (4000000);
-> create table test5 partition of test for values from (4000000) to (5000000);
-> create table test6 partition of test for values from (5000000) to (6000000);
-> create unique index myindex on test(b) global;
-> insert into test values(generate_series(0,5999999), generate_series(0,5999999), 'test'); /* record timing */
-> delete from test; /* record timing */
-> drop index myindex;
-> insert into test values(generate_series(0,5999999), generate_series(0,5999999), 'test');
-> create unique index myindex on test(b) global; /* record timing */
-> create table test7 (a int, b int, c text);
-> insert into test7 values(generate_series(6000000, 6999999), generate_series(6000000, 6999999), 'test');
-> alter table test attach partition test7 for values from (6000000) TO (7000000); /* record timing */
-> alter table test detach partition test7; /* record timing */
As you can see, insert operation suffers the most performance drawbacks. In fact, it takes roughly 6 times as much time to complete the insertion, which matches the number of partitions in the test.
The Attach operation also takes roughly 6 times as much time to complete, because it has to performs uniqueness check on all 6 existing partitions to determine global uniqueness. Detach in both case takes the same time to complete.
Create global unique index takes 35% longer to build.
We also ran some tests for random SELECT and UPDATE using non-partition key with pgbench to compare the performance among 3 conditions: no index, regular unique index (with partition-key involved), and global unique index:
Test 1: scale=100, 10 partitions, 1 million tuples/partition
SELECT:
-> No partitioned index: tps = 3.827886
-> regular unique index: tps = 14.713099
-> global unique index: tps = 23791.314238
UPDATE mixed with SELECT:
-> No partitioned index: tps = 1.926013
-> regular unique index: tps = 7.087059
-> global unique index: tps = 2253.098335
Test 2: scale=1,000, 100 partitions, 1 million tuples/partition
SELECT:
-> No partitioned index: tps = 0.110029
-> regular unique index: tps = 0.268199
-> global unique index: tps = 2334.811682
UPDATE mixed with SELECT:
-> No partitioned index: tps = 0.115329
-> regular unique index: tps = 0.197052
-> global unique index: tps = 541.488621
Test 3: scale=10,000, 1,000 partitions, 1 million tuples/partition
SELECT:
-> No partitioned index: tps = 0.011047
-> regular unique index: tps = 0.036812
-> global unique index: tps = 147.189456
UPDATE mixed with SELECT:
-> No partitioned index: tps = 0.008209
-> regular unique index: tps = 0.054367
-> global unique index: tps = 57.740432
thank you very much and we hope this information could help clarify some concerns about this approach.
David and Cary
============================
HighGo Software Canada
---- On Mon, 21 Nov 2022 05:33:30 -0700 Simon Riggs wrote ---
Show quoted text
On Thu, 17 Nov 2022 at 22:01, Cary Huang mailto:cary.huang@highgo.ca> wrote:
Patch: Global Unique Index
Let me start by expressing severe doubt on the usefulness of such a
feature, but also salute your efforts to contribute.
In other words, a global unique index and a regular partitioned index are essentially the same in terms of their storage structure except that one can do cross-partition uniqueness check, the other cannot.
This is the only workable architecture, since it allows DETACH to be
feasible, which is essential.
You don't seem to mention that this would require a uniqueness check
on each partition. Is that correct? This would result in O(N) cost of
uniqueness checks, severely limiting load speed. I notice you don't
offer any benchmarks on load speed or the overhead associated with
this, which is not good if you want to be taken seriously, but at
least it is recoverable.
(It might be necessary to specify some partitions as READ ONLY, to
allow us to record their min/max values for the indexed cols, allowing
us to do this more quickly.)
- Supported Features -
1. Global unique index is supported only on btree index type
Why? Surely any index type that supports uniqueness is good.
- Not-supported Features -
1. Global uniqueness check with Sub partition tables is not yet supported as we do not have immediate use case and it may involve majoy change in current implementation
Hmm, sounds like a problem. Arranging the calls recursively should work.
- Create a global unique index -
To create a regular unique index on a partitioned table, Postgres has to perform heap scan and sorting on every child partition. Uniqueness check happens during the sorting phase and will raise an error if multiple tuples with the same index key are sorted together. To achieve global uniqueness check, we make Postgres perform the sorting after all of the child partitions have been scanned instead of on the "sort per partition" fashion. In otherwords, the sorting only happens once at the very end and it sorts the tuples coming from all the partitions and therefore can ensure global uniqueness.
My feeling is that performance on this will suck so badly that we must
warn people away from it, and tell people if they want this, create
the index at the start and let it load.
Hopefully CREATE INDEX CONCURRENTLY still works.
Let's see some benchmarks on this also please.
You'll need to think about progress reporting early because correctly
reporting the progress and expected run times are likely critical for
usability.
Example:
CREATE TABLE gidx_part (a int, b int, c text) PARTITION BY RANGE (a);
CREATE TABLE gidx_part1 partition of gidx_part FOR VALUES FROM (0) TO (10);
CREATE TABLE gidx_part2 partition of gidx_part FOR VALUES FROM (10) TO (20);
CREATE UNIQUE INDEX global_unique_idx ON gidx_part USING BTREE(b) GLOBAL;
INSERT INTO gidx_part values(5, 5, 'test');
INSERT INTO gidx_part values(15, 5, 'test');
ERROR: duplicate key value violates unique constraint "gidx_part1_b_idx"
DETAIL: Key (b)=(5) already exists.
Well done.
- DETACH -
Since we retain the same partitioned structure, detaching a partition with global unique index is straightforward. Upon DETACH, Postgres will change its relkind from RELKIND_GLOBAL_INDEX to RELKIND_INDEX and remove their inheritance relationship as usual.
It's the only way that works
- Optimizer, query planning and vacuum -
Since no major modification is done on global unique index's structure and storage, it works in the same way as a regular partitioned index. No major change is required to be done on optimizer, planner and vacuum process as they should work in the same way as regular index.
Agreed
Making a prototype is a great first step.
The next step is to understand the good and the bad aspects of it, so
you can see what else needs to be done. You need to be honest and real
about the fact that this may not actually be desirable in practice, or
in a restricted use case.
That means performance analysis of create, load, attach, detach,
INSERT, SELECT, UPD/DEL and anything else that might be affected,
together with algorithmic analysis of what happens for larger N and
larger tables.
Expect many versions; take provisions for many days.
Best of luck
--
Simon Riggs http://www.EnterpriseDB.com/
Tom Lane schrieb am 18.11.2022 um 16:06:
Do we need new syntax actually? I think that a global unique index
can be created automatically instead of raising an error "unique
constraint on partitioned table must include all partitioning
columns"I'm not convinced that we want this feature at all: as far as I can
see, it will completely destroy the benefits of making a partitioned
table in the first place. But if we do want it, I don't think it
should be so easy to create a global index by accident as that syntax
approach would make it. I think there needs to be a pretty clear YES
I WANT TO SHOOT MYSELF IN THE FOOT clause in the command.
There are many Oracle users that find global indexes useful despite
their disadvantages.
I have seen this mostly when the goal was to get the benefits of
partition pruning at runtime which turned the full table scan (=Seq Scan)
on huge tables to partition scans on much smaller partitions.
Partition wise joins were also helpful for query performance.
The substantially slower drop partition performance was accepted in thos cases
I think it would be nice to have the option in Postgres as well.
I do agree however, that the global index should not be created automatically.
Something like CREATE GLOBAL [UNIQUE] INDEX ... would be a lot better
Just my 0.05€
st 23. 11. 2022 v 23:42 odesílatel Thomas Kellerer <shammat@gmx.net> napsal:
Tom Lane schrieb am 18.11.2022 um 16:06:
Do we need new syntax actually? I think that a global unique index
can be created automatically instead of raising an error "unique
constraint on partitioned table must include all partitioning
columns"I'm not convinced that we want this feature at all: as far as I can
see, it will completely destroy the benefits of making a partitioned
table in the first place. But if we do want it, I don't think it
should be so easy to create a global index by accident as that syntax
approach would make it. I think there needs to be a pretty clear YES
I WANT TO SHOOT MYSELF IN THE FOOT clause in the command.There are many Oracle users that find global indexes useful despite
their disadvantages.I have seen this mostly when the goal was to get the benefits of
partition pruning at runtime which turned the full table scan (=Seq Scan)
on huge tables to partition scans on much smaller partitions.
Partition wise joins were also helpful for query performance.
The substantially slower drop partition performance was accepted in thos
cases
I think it would be nice to have the option in Postgres as well.
I do agree however, that the global index should not be created
automatically.Something like CREATE GLOBAL [UNIQUE] INDEX ... would be a lot better
Is it necessary to use special marks like GLOBAL if this index will be
partitioned, and uniqueness will be ensured by repeated evaluations?
Or you think so there should be really forced one relation based index?
I can imagine a unique index on partitions without a special mark, that
will be partitioned, and a second variant classic index created over a
partitioned table, that will be marked as GLOBAL.
Regards
Pavel
Show quoted text
Just my 0.05€
Pavel Stehule schrieb am 24.11.2022 um 07:03:
There are many Oracle users that find global indexes useful despite
their disadvantages.I have seen this mostly when the goal was to get the benefits of
partition pruning at runtime which turned the full table scan (=Seq Scan)
on huge tables to partition scans on much smaller partitions.
Partition wise joins were also helpful for query performance.
The substantially slower drop partition performance was accepted in thos casesI think it would be nice to have the option in Postgres as well.
I do agree however, that the global index should not be created automatically.
Something like CREATE GLOBAL [UNIQUE] INDEX ... would be a lot better
Is it necessary to use special marks like GLOBAL if this index will
be partitioned, and uniqueness will be ensured by repeated
evaluations?Or you think so there should be really forced one relation based
index?I can imagine a unique index on partitions without a special mark,
that will be partitioned, and a second variant classic index created
over a partitioned table, that will be marked as GLOBAL.
My personal opinion is, that a global index should never be created
automatically.
The user should consciously decide on using a feature
that might have a serious impact on performance in some areas.
On Fri, Nov 18, 2022 at 3:31 AM Cary Huang <cary.huang@highgo.ca> wrote:
Patch: Global Unique Index
- Optimizer, query planning and vacuum -
Since no major modification is done on global unique index's structure and storage, it works in the same way as a regular partitioned index. No major change is required to be done on optimizer, planner and vacuum process as they should work in the same way as regular index.
It might not need changes in the vacuum to make it work. But this can
not be really useful without modifying the vacuum the way it works. I
mean currently, the indexes are also partitioned based on the table so
whenever we do table vacuum it's fine to do index vacuum but now you
will have one gigantic index and which will be vacuumed every time we
vacuum any of the partitions. So for example, if you have 10000
partitions then by the time you vacuum the whole table (all 10000
partitions) the global index will be vacuumed 10000 times.
There was some effort in past (though it was not concluded) about
decoupling the index and heap vacuuming such that instead of doing the
index vacuum for each partition we remember the dead tids and we only
do the index vacuum when we think there are enough dead items so that
the index vacuum makes sense[1]/messages/by-id/CA+TgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A@mail.gmail.com.
[1]: /messages/by-id/CA+TgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A@mail.gmail.com
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Thu, Nov 24, 2022 at 07:03:24AM +0100, Pavel Stehule wrote:
I can imagine a unique index on partitions without a special mark, that
will be partitioned,
That exists since v11, as long as the index keys include the partition
keys.
and a second variant classic index created over a partitioned table,
that will be marked as GLOBAL.
That's not what this patch is about, though.
On Thu, Nov 24, 2022 at 08:52:16PM +0530, Dilip Kumar wrote:
but now you will have one gigantic index and which will be vacuumed
every time we vacuum any of the partitions.
This patch isn't implemented as "one gigantic index", though.
--
Justin
---- On Thu, 24 Nov 2022 08:00:59 -0700 Thomas Kellerer wrote ---
Pavel Stehule schrieb am 24.11.2022 um 07:03:
There are many Oracle users that find global indexes useful despite
their disadvantages.I have seen this mostly when the goal was to get the benefits of
partition pruning at runtime which turned the full table scan (=Seq Scan)
on huge tables to partition scans on much smaller partitions.
Partition wise joins were also helpful for query performance.
The substantially slower drop partition performance was accepted in thos casesI think it would be nice to have the option in Postgres as well.
I do agree however, that the global index should not be created automatically.
Something like CREATE GLOBAL [UNIQUE] INDEX ... would be a lot better
Is it necessary to use special marks like GLOBAL if this index will
be partitioned, and uniqueness will be ensured by repeated
evaluations?Or you think so there should be really forced one relation based
index?I can imagine a unique index on partitions without a special mark,
that will be partitioned, and a second variant classic index created
over a partitioned table, that will be marked as GLOBAL.My personal opinion is, that a global index should never be created
automatically.The user should consciously decide on using a feature
that might have a serious impact on performance in some areas.
Agreed, if a unique index is created on non-partition key columns without including the special mark (partition key columns), it may be a mistake from user. (At least I make this mistake all the time). Current PG will give you a warning to include the partition keys, which is good.
If we were to automatically turn that into a global unique index, user may be using the feature without knowing and experiencing some performance impacts (to account for extra uniqueness check in all partitions).
On Thu, Nov 24, 2022 at 9:39 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Thu, Nov 24, 2022 at 08:52:16PM +0530, Dilip Kumar wrote:
but now you will have one gigantic index and which will be vacuumed
every time we vacuum any of the partitions.This patch isn't implemented as "one gigantic index", though.
If this patch is for supporting a global index then I expect that the
global index across all the partitions is going to be big. Anyway, my
point was about vacuuming the common index every time you vacuum any
of the partitions of the table is not the right way and that will make
global indexes less usable.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 25, 2022 at 8:49 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Nov 24, 2022 at 9:39 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Thu, Nov 24, 2022 at 08:52:16PM +0530, Dilip Kumar wrote:
but now you will have one gigantic index and which will be vacuumed
every time we vacuum any of the partitions.This patch isn't implemented as "one gigantic index", though.
If this patch is for supporting a global index then I expect that the
global index across all the partitions is going to be big. Anyway, my
point was about vacuuming the common index every time you vacuum any
of the partitions of the table is not the right way and that will make
global indexes less usable.
Okay, I got your point. After seeing the details it seems instead of
supporting one common index it is just allowing uniqueness checks
across multiple index partitions. Sorry for the noise.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 21, 2022 at 12:33:30PM +0000, Simon Riggs wrote:
On Thu, 17 Nov 2022 at 22:01, Cary Huang <cary.huang@highgo.ca> wrote:
Patch: Global Unique Index
Let me start by expressing severe doubt on the usefulness of such a
feature, but also salute your efforts to contribute.In other words, a global unique index and a regular partitioned index are essentially the same in terms of their storage structure except that one can do cross-partition uniqueness check, the other cannot.
This is the only workable architecture, since it allows DETACH to be
feasible, which is essential.
I had trouble understanding this feature so I spent some time thinking
about it. I don't think this is really a global unique index, meaning
it is not one index with all the value in the index. Rather it is the
enforcement of uniqueness across all of a partitioned table's indexes.
I think global indexes have a limited enough use-case that this patch's
approach is as close as we are going to get to it in the foreseeable
future.
Second, I outlined the three values of global indexes in this blog
entry, based on a 2019 email thread:
https://momjian.us/main/blogs/pgblog/2020.html#July_1_2020
/messages/by-id/CA+Tgmob_J2M2+QKWrhg2NjQEkMEwZNTfd7a6Ubg34fJuZPkN2g@mail.gmail.com
The three values are:
1. The ability to reference partitioned tables as foreign keys
without requiring the partition key to be part of the foreign
key reference; Postgres 12 allows such foreign keys if they match
partition keys.
2. The ability to add a uniqueness constraint to a partitioned
table where the unique columns are not part of the partition key.
3. The ability to index values that only appear in a few
partitions, and are not part of the partition key.
This patch should help with #1 and #2, but not #3. The uniqueness
guarantee allows, on average, half of the partitioned table's indexes to
be checked if there is a match, and all partitioned table's indexes if
not. This is because once you find a match, you don't need to keep
checking because the value is unique.
Looking at the patch, I am unclear how the the patch prevents concurrent
duplicate value insertion during the partitioned index checking. I am
actually not sure how that can be done without locking all indexes or
inserting placeholder entries in all indexes. (Yeah, that sounds bad,
unless I am missing something.)
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Embrace your flaws. They make you human, rather than perfect,
which you will never be.
Hi Bruce,
Thank you for helping review the patches in such detail.
On 2022-11-25 9:48 a.m., Bruce Momjian wrote:
Looking at the patch, I am unclear how the the patch prevents concurrent
duplicate value insertion during the partitioned index checking. I am
actually not sure how that can be done without locking all indexes or
inserting placeholder entries in all indexes. (Yeah, that sounds bad,
unless I am missing something.)
For the uniqueness check cross all partitions, we tried to follow the
implementation of uniqueness check on a single partition, and added a
loop to check uniqueness on other partitions after the index tuple has
been inserted to current index partition but before this index tuple has
been made visible. The uniqueness check will wait `XactLockTableWait` if
there is a valid transaction in process, and performs the uniqueness
check again after the in-process transaction finished.
We tried to simulate this duplicate value case in blow steps:
1) prepare the partitioned table,
CREATE TABLE gidx_part (a int, b int, c text) PARTITION BY RANGE (a);
CREATE TABLE gidx_part1 partition of gidx_part FOR VALUES FROM (0) TO (10);
CREATE TABLE gidx_part2 partition of gidx_part FOR VALUES FROM (10) TO (20);
2) having two psql consoles hooked up with gdbs and set break points
after _bt_doinsert
result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged, heapRel);
inside btinsert function in nbtree.c file.
3) first, execute `INSERT INTO gidx_part values(1, 1, 'test');` on
console-1, and then execute `INSERT INTO gidx_part values(11, 1,
'test');` on console-2 (expect duplicated value '1' in the 2nd column to
be detected),
The test results is that: console-2 query will have to wait until either
console-1 committed or aborted. If console-1 committed, then console-2
reports duplicated value already exists; if console-1 aborted, then
console-2 will report insert successfully. If there is a deadlock, then
the one detected this deadlock will error out to allow the other one
continue.
I am not quite sure if this is a proper way to deal with a deadlock in
this case. It would be so grateful if someone could help provide some
cases/methods to verify this cross all partitions uniqueness.
Best regards,
David
============================
HighGo Software Canada
www.highgo.ca <http://www.highgo.ca>
On Fri, Nov 25, 2022 at 05:03:06PM -0800, David Zhang wrote:
Hi Bruce,
Thank you for helping review the patches in such detail.
On 2022-11-25 9:48 a.m., Bruce Momjian wrote:
Looking at the patch, I am unclear how the the patch prevents concurrent
duplicate value insertion during the partitioned index checking. I am
actually not sure how that can be done without locking all indexes or
inserting placeholder entries in all indexes. (Yeah, that sounds bad,
unless I am missing something.)For the uniqueness check cross all partitions, we tried to follow the
implementation of uniqueness check on a single partition, and added a loop to
check uniqueness on other partitions after the index tuple has been inserted to
current index partition but before this index tuple has been made visible. The
uniqueness check will wait `XactLockTableWait` if there is a valid transaction
in process, and performs the uniqueness check again after the in-process
transaction finished.
I can't see why this wouldn't work, but I also can't think of any cases
where we do this in our code already, so it will need careful
consideration.
We kind of do this for UPDATE and unique key conflicts, but only for a
single index entry. where we peek and sleep on pending changes, but not
across indexes.
I am not quite sure if this is a proper way to deal with a deadlock in this
case. It would be so grateful if someone could help provide some cases/methods
to verify this cross all partitions uniqueness.
I assume you are using our existing deadlock detection code, and just
sleeping in various indexes and expecting deadlock detection to happen.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Embrace your flaws. They make you human, rather than perfect,
which you will never be.
On Fri, Nov 18, 2022 at 12:03:53PM +0300, Sergei Kornilov wrote:
Hello
Do we need new syntax actually? I think that a global unique index can be created automatically instead of raising an error "unique constraint on partitioned table must include all partitioning columns"
I may suggest even more of the new syntax.
If someone has to implement sequential index checking on unique
constraints, then it would be useful to be able to do that inde-
pendent of partitioning also.
E.g. for some kinds of manual partitions or for strangely de-
signed datasets. Or for some of the table partitions instead for
all of them.
For that reason, perhaps some other type of unique index -- that
is not an index per se, but a check against a set of indexes --
could be added. Or, perhaps, not an index, but an EXCLUDE con-
straint of that kind.
On 11/24/22 19:15, Cary Huang wrote:
---- On Thu, 24 Nov 2022 08:00:59 -0700 Thomas Kellerer wrote ---
Pavel Stehule schrieb am 24.11.2022 um 07:03:
There are many Oracle users that find global indexes useful despite
their disadvantages.I have seen this mostly when the goal was to get the benefits of
partition pruning at runtime which turned the full table scan (=Seq Scan)
on huge tables to partition scans on much smaller partitions.
Partition wise joins were also helpful for query performance.
The substantially slower drop partition performance was accepted in thos casesI think it would be nice to have the option in Postgres as well.
I do agree however, that the global index should not be created automatically.
Something like CREATE GLOBAL [UNIQUE] INDEX ... would be a lot better
Is it necessary to use special marks like GLOBAL if this index will
be partitioned, and uniqueness will be ensured by repeated
evaluations?Or you think so there should be really forced one relation based
index?I can imagine a unique index on partitions without a special mark,
that will be partitioned, and a second variant classic index created
over a partitioned table, that will be marked as GLOBAL.My personal opinion is, that a global index should never be created
automatically.The user should consciously decide on using a feature
that might have a serious impact on performance in some areas.Agreed, if a unique index is created on non-partition key columns without including the special mark (partition key columns), it may be a mistake from user. (At least I make this mistake all the time). Current PG will give you a warning to include the partition keys, which is good.
If we were to automatically turn that into a global unique index, user may be using the feature without knowing and experiencing some performance impacts (to account for extra uniqueness check in all partitions).
I disagree. A user does not need to know that a table is partitionned,
and if the user wants a unique constraint on the table then making them
type an extra word to get it is just annoying.
--
Vik Fearing
On Tue, 2022-11-29 at 13:58 +0100, Vik Fearing wrote:
I disagree. A user does not need to know that a table is partitionned,
and if the user wants a unique constraint on the table then making them
type an extra word to get it is just annoying.
Hmm. But if I created a primary key without thinking too hard about it,
only to discover later that dropping old partitions has become a problem,
I would not be too happy either.
Yours,
Laurenz Albe
On Fri, 25 Nov 2022 at 20:03, David Zhang <david.zhang@highgo.ca> wrote:
Hi Bruce,
Thank you for helping review the patches in such detail.
On 2022-11-25 9:48 a.m., Bruce Momjian wrote:
Looking at the patch, I am unclear how the the patch prevents concurrent
duplicate value insertion during the partitioned index checking. I am
actually not sure how that can be done without locking all indexes or
inserting placeholder entries in all indexes. (Yeah, that sounds bad,
unless I am missing something.)For the uniqueness check cross all partitions, we tried to follow the implementation of uniqueness check on a single partition, and added a loop to check uniqueness on other partitions after the index tuple has been inserted to current index partition but before this index tuple has been made visible. The uniqueness check will wait `XactLockTableWait` if there is a valid transaction in process, and performs the uniqueness check again after the in-process transaction finished.
I think this is the key issue to discuss. The rest is all UX
bikeshedding (which is pretty important in this case) but this is the
core uniqueness implementation.
If I understand correctly you're going to insert into the local index
for the partition using the normal btree uniqueness implementation.
Then while holding an exclusive lock on the index do lookups on every
partition for the new key. Effectively serializing inserts to the
table?
I think the precedent here are "exclusion constraints" which are
documented in two places in the manual:
https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-EXCLUSION
https://www.postgresql.org/docs/current/sql-createtable.html#SQL-CREATETABLE-EXCLUDE
These also work by doing lookups for violating entries and don't
depend on any special index machinery like btree uniqueness. But I
don't think they need to entirely serialize inserts either so it may
be worth trying to figure out how they manage this to avoid imposing
that overhead.
There's a comment in src/backend/executor/execIndexing.c near the top
about them but I'm not sure it covers all the magic needed for them to
work...
--
greg
Greg Stark <stark@mit.edu> writes:
If I understand correctly you're going to insert into the local index
for the partition using the normal btree uniqueness implementation.
Then while holding an exclusive lock on the index do lookups on every
partition for the new key. Effectively serializing inserts to the
table?
... not to mention creating a high probability of deadlocks between
concurrent insertions to different partitions. If they each
ex-lock their own partition's index before starting to look into
other partitions' indexes, it seems like a certainty that such
cases would fail. The rule of thumb about locking multiple objects
is that all comers had better do it in the same order, and this
isn't doing that.
That specific issue could perhaps be fixed by having everybody
examine all the indexes in the same order, inserting when you
come to your own partition's index and otherwise just checking
for conflicts. But that still means serializing insertions
across all the partitions. And the fact that you need to lock
all the partitions, or even just know what they all are, is
going to play hob with a lot of assumptions we've made about
different partitions being independent, and about what locks
are needed for operations like ALTER TABLE ATTACH PARTITION.
(I wonder BTW what the game plan is for attaching a partition
to a partitioned table having a global index. Won't that mean
having to check every row in the new partition against every
one of the existing partitions? So much for ATTACH being fast.)
I still think this is a dead end that will never get committed.
If folks want to put time into perhaps finding an ingenious
way around these problems, okay; but they'd better realize that
there's a high probability of failure, or at least coming out
with something nobody will want to use.
regards, tom lane
On Tue, Nov 29, 2022 at 06:13:56PM -0500, Tom Lane wrote:
Greg Stark <stark@mit.edu> writes:
If I understand correctly you're going to insert into the local index
for the partition using the normal btree uniqueness implementation.
Then while holding an exclusive lock on the index do lookups on every
partition for the new key. Effectively serializing inserts to the
table?... not to mention creating a high probability of deadlocks between
concurrent insertions to different partitions. If they each
ex-lock their own partition's index before starting to look into
other partitions' indexes, it seems like a certainty that such
cases would fail. The rule of thumb about locking multiple objects
is that all comers had better do it in the same order, and this
isn't doing that.
I am not sure why they would need to exclusive lock anything more than
the unique index entry they are adding, just like UPDATE does.
I still think this is a dead end that will never get committed.
If folks want to put time into perhaps finding an ingenious
way around these problems, okay; but they'd better realize that
there's a high probability of failure, or at least coming out
with something nobody will want to use.
Agreed, my earlier point was that this would need a lot of thought to
get right since we don't do this often. The exclusion constraint is a
close example, though that is in a single index.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Embrace your flaws. They make you human, rather than perfect,
which you will never be.
Bruce Momjian <bruce@momjian.us> writes:
On Tue, Nov 29, 2022 at 06:13:56PM -0500, Tom Lane wrote:
... not to mention creating a high probability of deadlocks between
concurrent insertions to different partitions. If they each
ex-lock their own partition's index before starting to look into
other partitions' indexes, it seems like a certainty that such
cases would fail. The rule of thumb about locking multiple objects
is that all comers had better do it in the same order, and this
isn't doing that.
I am not sure why they would need to exclusive lock anything more than
the unique index entry they are adding, just like UPDATE does.
Assuming that you are inserting into index X, and you've checked
index Y to find that it has no conflicts, what prevents another
backend from inserting a conflict into index Y just after you look?
AIUI the idea is to prevent that by continuing to hold an exclusive
lock on the whole index Y until you've completed the insertion.
Perhaps there's a better way to do that, but it's not what was
described.
I actually think that that problem should be soluble with a
slightly different approach. The thing that feels insoluble
is that you can't do this without acquiring sufficient locks
to prevent addition of new partitions while the insertion is
in progress. That will be expensive in itself, and it will
turn ATTACH PARTITION into a performance disaster.
regards, tom lane
On Tue, Nov 29, 2022 at 09:16:23PM -0500, Tom Lane wrote:
Assuming that you are inserting into index X, and you've checked
index Y to find that it has no conflicts, what prevents another
backend from inserting a conflict into index Y just after you look?
AIUI the idea is to prevent that by continuing to hold an exclusive
lock on the whole index Y until you've completed the insertion.
Perhaps there's a better way to do that, but it's not what was
described.
As I understood it, you insert into index X and then scan all other
indexes to look for a conflict --- if you find one, you abort with a
unique index conflict. Other index changes do the same.
So, for example, one session inserts into index X and then scans all
other indexes. During the index scan, another session inserts into
index Y, but its scan sees the index X addition and gets a uniqueness
conflict error.
I actually think that that problem should be soluble with a
slightly different approach. The thing that feels insoluble
is that you can't do this without acquiring sufficient locks
to prevent addition of new partitions while the insertion is
in progress. That will be expensive in itself, and it will
turn ATTACH PARTITION into a performance disaster.
Yes, that would require index locks.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Embrace your flaws. They make you human, rather than perfect,
which you will never be.
On 11/29/22 17:29, Laurenz Albe wrote:
On Tue, 2022-11-29 at 13:58 +0100, Vik Fearing wrote:
I disagree. A user does not need to know that a table is partitionned,
and if the user wants a unique constraint on the table then making them
type an extra word to get it is just annoying.Hmm. But if I created a primary key without thinking too hard about it,
only to discover later that dropping old partitions has become a problem,
I would not be too happy either.
I have not looked at this patch, but my understanding of its design is
the "global" part of the index just makes sure to check a unique index
on each partition. I don't see from that how dropping old partitions
would be a problem.
--
Vik Fearing
On Wed, 2022-11-30 at 10:09 +0100, Vik Fearing wrote:
On 11/29/22 17:29, Laurenz Albe wrote:
On Tue, 2022-11-29 at 13:58 +0100, Vik Fearing wrote:
I disagree. A user does not need to know that a table is partitionned,
and if the user wants a unique constraint on the table then making them
type an extra word to get it is just annoying.Hmm. But if I created a primary key without thinking too hard about it,
only to discover later that dropping old partitions has become a problem,
I would not be too happy either.I have not looked at this patch, but my understanding of its design is
the "global" part of the index just makes sure to check a unique index
on each partition. I don't see from that how dropping old partitions
would be a problem.
Right, I should have looked closer. But, according to the parallel discussion,
ATTACH PARTITION might be a problem. A global index is likely to be a footgun
one way or the other, so I think it should at least have a safety on
(CREATE PARTITIONED GLOBAL INDEX or something).
Yours,
Laurenz Albe
On Tue, 29 Nov 2022 at 21:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I actually think that that problem should be soluble with a
slightly different approach. The thing that feels insoluble
is that you can't do this without acquiring sufficient locks
to prevent addition of new partitions while the insertion is
in progress. That will be expensive in itself, and it will
turn ATTACH PARTITION into a performance disaster.
I think there`s a lot of room to manoeuvre here. This is a new feature
that doesn't need to be 100% complete or satisfy any existing
standard. There are lots of options for compromises that leave room
for future improvements.
1) We could just say sure ATTACH is slow if you're attaching an
non-empty partition
2) We could invent a concept like convalidated and let people attach a
partition without validating the uniqueness and then validate it later
concurrently
3) We could say ATTACH doesn't work now and come up with a better
strategy in the future
Also, don't I vaguely recall something in exclusion constraints about
having some kind of in-memory "intent" list where you declared that
you're about to insert a value, you validate it doesn't violate the
constraint and then you're free to insert it because anyone else will
see your intent in memory? There might be a need for some kind of
global object that only holds inserted keys long enough that other
sessions are guaranteed to see the key in the correct index. And that
could maybe even be in memory rather than on disk.
This isn't a simple project but I don't think it's impossible as long
as we keep an open mind about the requirements.
--
greg
Thanks a lot for all the comments.
On 2022-11-29 3:13 p.m., Tom Lane wrote:
... not to mention creating a high probability of deadlocks between
concurrent insertions to different partitions. If they each
ex-lock their own partition's index before starting to look into
other partitions' indexes, it seems like a certainty that such
cases would fail. The rule of thumb about locking multiple objects
is that all comers had better do it in the same order, and this
isn't doing that.
In the current POC patch, the deadlock is happening when backend-1
inserts a value to index X(partition-1), and backend-2 try to insert a
conflict value right after backend-1 released the buffer block lock but
before start to check unique on index Y(partition-2). In this case,
backend-1 holds ExclusiveLock on transaction-1 and waits for ShareLock
on transaction-2 , while backend-2 holds ExclusiveLock on transaction-2
and waits for ShareLock on transaction-1. Based on my debugging tests,
this only happens when backend-1 and backend-2 want to insert a conflict
value. If this is true, then is it ok to either `deadlock` error out or
`duplicated value` error out since this is a conflict value? (hopefully
end users can handle it in a similar way). I think the probability of
such deadlock has two conditions: 1) users insert a conflict value and
plus 2) the uniqueness checking happens in the right moment (see above).
That specific issue could perhaps be fixed by having everybody
examine all the indexes in the same order, inserting when you
come to your own partition's index and otherwise just checking
for conflicts. But that still means serializing insertions
across all the partitions. And the fact that you need to lock
all the partitions, or even just know what they all are,
Here is the main change for insertion cross-partition uniqueness check
in `0004-support-global-unique-index-insert-and-update.patch`,
result = _bt_doinsert(rel, itup, checkUnique, indexUnchanged,
heapRel);
+ if (checkUnique != UNIQUE_CHECK_NO)
+ btinsert_check_unique_gi(itup, rel, heapRel, checkUnique);
+
pfree(itup);
where, a cross-partition uniqueness check is added after the index tuple
btree insertion on current partition. The idea is to make sure other
backends can find out the ongoing index tuple just inserted (but before
marked as visible yet), and the current partition uniqueness check can
be skipped as it has already been checked. Based on this change, I think
the insertion serialization can happen in two cases: 1) two insertions
happen on the same buffer block (buffer lock waiting); 2) two ongoing
insertions with duplicated values (transaction id waiting);
On 2022-11-29 6:16 p.m., Tom Lane wrote:
Assuming that you are inserting into index X, and you've checked
index Y to find that it has no conflicts, what prevents another
backend from inserting a conflict into index Y just after you look?
AIUI the idea is to prevent that by continuing to hold an exclusive
lock on the whole index Y until you've completed the insertion.
Perhaps there's a better way to do that, but it's not what was
described.
Another main change in patch
`0004-support-global-unique-index-insert-and-update.patch`,
+ search_global:
+ stack = _bt_search(iRel, insertstate.itup_key,
+ &insertstate.buf, BT_READ,
NULL);
+ xwait = _bt_check_unique_gi(iRel, &insertstate,
+ hRel, checkUnique,
&is_unique,
+ &speculativeToken, heapRel);
+ if (unlikely(TransactionIdIsValid(xwait)))
+ {
... ...
+ goto search_global;
+ }
Here, I am trying to use `BT_READ` to require a LW_SHARED lock on the
buffer block if a match found using `itup_key` search key. The
cross-partition uniqueness checking will wait if the index tuple
insertion on this buffer block has not done yet, otherwise runs the
uniqueness check to see if there is an ongoing transaction which may
insert a conflict value. Once the ongoing insertion is done, it will go
back and check again (I think it can also handle the case that a
potential conflict index tuple was later marked as dead in the same
transaction). Based on this change, my test results are:
1) a select-only query will not be blocked by the ongoing insertion on
index X
2) insertion happening on index Y may wait for the buffer block lock
when inserting a different value but it does not wait for the
transaction lock held by insertion on index X.
3) when an insertion inserting a conflict value on index Y,
3.1) it waits for buffer block lock if the lock has been held by
the insertion on index X.
3.2) then, it waits for transaction lock until the insertion on
index X is done.
Hi!
Sorry to bother - but is this patch used in IvorySQL?
Here:
https://www.ivorysql.org/docs/Global%20Unique%20Index/create_global_unique_index
According to syntax it definitely looks like this patch.
Thank you!
On Sat, Dec 3, 2022 at 3:05 AM David Zhang <david.zhang@highgo.ca> wrote:
On 2022-11-29 6:16 p.m., Tom Lane wrote:
Assuming that you are inserting into index X, and you've checked
index Y to find that it has no conflicts, what prevents another
backend from inserting a conflict into index Y just after you look?
AIUI the idea is to prevent that by continuing to hold an exclusive
lock on the whole index Y until you've completed the insertion.
Perhaps there's a better way to do that, but it's not what was
described.Another main change in patch `0004-support-global-unique-index-insert-and-update.patch`, + search_global: + stack = _bt_search(iRel, insertstate.itup_key, + &insertstate.buf, BT_READ, NULL); + xwait = _bt_check_unique_gi(iRel, &insertstate, + hRel, checkUnique, &is_unique, + &speculativeToken, heapRel); + if (unlikely(TransactionIdIsValid(xwait))) + { ... ... + goto search_global; + }Here, I am trying to use `BT_READ` to require a LW_SHARED lock on the
buffer block if a match found using `itup_key` search key. The
cross-partition uniqueness checking will wait if the index tuple
insertion on this buffer block has not done yet, otherwise runs the
uniqueness check to see if there is an ongoing transaction which may
insert a conflict value. Once the ongoing insertion is done, it will go
back and check again (I think it can also handle the case that a
potential conflict index tuple was later marked as dead in the same
transaction). Based on this change, my test results are:1) a select-only query will not be blocked by the ongoing insertion on
index X2) insertion happening on index Y may wait for the buffer block lock
when inserting a different value but it does not wait for the
transaction lock held by insertion on index X.3) when an insertion inserting a conflict value on index Y,
3.1) it waits for buffer block lock if the lock has been held by
the insertion on index X.
3.2) then, it waits for transaction lock until the insertion on
index X is done.
--
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/
On 2022-12-19 7:51 a.m., Nikita Malakhov wrote:
Sorry to bother - but is this patch used in IvorySQL?
Here:
https://www.ivorysql.org/docs/Global%20Unique%20Index/create_global_unique_index
According to syntax it definitely looks like this patch.
The global unique index is one of the features required in IvorySQL
development. We want to share it to the communities to get more
feedback, and then hopefully we could better contribute it back to
PostgreSQL.
Best regards,
David
On 2022-11-29 6:16 p.m., Tom Lane wrote:
Assuming that you are inserting into index X, and you've checked
index Y to find that it has no conflicts, what prevents another
backend from inserting a conflict into index Y just after you look?
AIUI the idea is to prevent that by continuing to hold an exclusive
lock on the whole index Y until you've completed the insertion.
Perhaps there's a better way to do that, but it's not what was
described.
During inserts, global unique index patch does not acquire exclusive
lock on the whole index Y while checking it for the uniqueness; it
acquires a low level AccessShareLock on Y and will release after
checking. So while it is checking, another backend can still insert a
duplicate in index Y. If this is the case, a "transaction level lock"
will be triggered.
For example.
Say backend A inserts into index X, and checks index Y to find no
conflict, and backend B inserts a conflict into index Y right after. In
this case, backend B still has to check index X for conflict and It will
fetch a duplicate tuple that has been inserted by A, but it cannot
declare a duplicate error yet. This is because the transaction inserting
this conflict tuple started by backend A is still in progress. At this
moment, backend B has to wait for backend A to commit / abort before it
can continue. This is how "transaction level lock" prevents concurrent
insert conflicts.
There is a chance of deadlock if the conflicting insertions done by A
and B happen at roughly the same time, where both backends trigger
"transaction level lock" to wait for each other to commit/abort. If this
is the case, PG's deadlock detection code will error out one of the
backends. It should be okay because it means one of the backends tries
to insert a conflict. The purpose of global unique index is also to
error out backends trying to insert duplicates. In the end the effects
are the same, it's just that the error says deadlock detected instead of
duplicate detected.
If backend B did not insert a conflicting tuple, no transaction lock
wait will be triggered, and therefore no deadlock will happen.
Regards
Cary Huang
-----------------------
HighGo Software Canada
On 2022-11-30 2:30 p.m., Greg Stark wrote:
On Tue, 29 Nov 2022 at 21:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I actually think that that problem should be soluble with a
slightly different approach. The thing that feels insoluble
is that you can't do this without acquiring sufficient locks
to prevent addition of new partitions while the insertion is
in progress. That will be expensive in itself, and it will
turn ATTACH PARTITION into a performance disaster.I think there`s a lot of room to manoeuvre here. This is a new feature
that doesn't need to be 100% complete or satisfy any existing
standard. There are lots of options for compromises that leave room
for future improvements.1) We could just say sure ATTACH is slow if you're attaching an
non-empty partition
2) We could invent a concept like convalidated and let people attach a
partition without validating the uniqueness and then validate it later
concurrently
3) We could say ATTACH doesn't work now and come up with a better
strategy in the futureAlso, don't I vaguely recall something in exclusion constraints about
having some kind of in-memory "intent" list where you declared that
you're about to insert a value, you validate it doesn't violate the
constraint and then you're free to insert it because anyone else will
see your intent in memory? There might be a need for some kind of
global object that only holds inserted keys long enough that other
sessions are guaranteed to see the key in the correct index. And that
could maybe even be in memory rather than on disk.This isn't a simple project but I don't think it's impossible as long
as we keep an open mind about the requirements.
In the current global unique index implementation, ATTACH can be slow if
there are concurrent inserts happening. ATTACH tries to acquire
shareLock on all existing partitions and partition-to-be before it scans
and sorts them for uniqueness check. It will release them only after all
partitions have been checked. If there are concurrent inserts, ATTACH
has to wait for all inserts complete. Likewise, if ATTACH is in
progress, inserts have to wait as well. This is an issue now.
If we were to make ATTACH acquire a lower level lock (AccessShareLock),
scans a partition, and then release it. there is nothing stopping any
concurrent inserts from inserting a conflict right after it finishes
checking. This is another issue. There is no transaction level lock
being triggered here like in multiple concurent inserts case
Another email thread called "create index concurrently on partitioned
index" discuss some approaches that may be used to solve the attach
issue here, basically to allow ATTACH PARTITION CONCURRENTLY...
regards
Cary Huang
---------------------------------
HighGo Software Canada
Hi!
Please advise on the status of this patch set - are there any improvements?
Is there any work going on?
Thanks!
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/